提交 486088bc 编写于 作者: L Linus Torvalds

Merge tag 'standardize-docs' of git://git.lwn.net/linux

Pull documentation format standardization from Jonathan Corbet:
 "This series converts a number of top-level documents to the RST format
  without incorporating them into the Sphinx tree. The hope is to bring
  some uniformity to kernel documentation and, perhaps more importantly,
  have our existing docs serve as an example of the desired formatting
  for those that will be added later.

  Mauro has gone through and fixed up a lot of top-level documentation
  files to make them conform to the RST format, but without moving or
  renaming them in any way. This will help when we incorporate the ones
  we want to keep into the Sphinx doctree, but the real purpose is to
  bring a bit of uniformity to our documentation and let the top-level
  docs serve as examples for those writing new ones"

* tag 'standardize-docs' of git://git.lwn.net/linux: (84 commits)
  docs: kprobes.txt: Fix whitespacing
  tee.txt: standardize document format
  cgroup-v2.txt: standardize document format
  dell_rbu.txt: standardize document format
  zorro.txt: standardize document format
  xz.txt: standardize document format
  xillybus.txt: standardize document format
  vfio.txt: standardize document format
  vfio-mediated-device.txt: standardize document format
  unaligned-memory-access.txt: standardize document format
  this_cpu_ops.txt: standardize document format
  svga.txt: standardize document format
  static-keys.txt: standardize document format
  smsc_ece1099.txt: standardize document format
  SM501.txt: standardize document format
  siphash.txt: standardize document format
  sgi-ioc4.txt: standardize document format
  SAK.txt: standardize document format
  rpmsg.txt: standardize document format
  robust-futexes.txt: standardize document format
  ...
Dynamic DMA mapping Guide =========================
========================= Dynamic DMA mapping Guide
=========================
David S. Miller <davem@redhat.com> :Author: David S. Miller <davem@redhat.com>
Richard Henderson <rth@cygnus.com> :Author: Richard Henderson <rth@cygnus.com>
Jakub Jelinek <jakub@redhat.com> :Author: Jakub Jelinek <jakub@redhat.com>
This is a guide to device driver writers on how to use the DMA API This is a guide to device driver writers on how to use the DMA API
with example pseudo-code. For a concise description of the API, see with example pseudo-code. For a concise description of the API, see
DMA-API.txt. DMA-API.txt.
CPU and DMA addresses CPU and DMA addresses
=====================
There are several kinds of addresses involved in the DMA API, and it's There are several kinds of addresses involved in the DMA API, and it's
important to understand the differences. important to understand the differences.
The kernel normally uses virtual addresses. Any address returned by The kernel normally uses virtual addresses. Any address returned by
kmalloc(), vmalloc(), and similar interfaces is a virtual address and can kmalloc(), vmalloc(), and similar interfaces is a virtual address and can
be stored in a "void *". be stored in a ``void *``.
The virtual memory system (TLB, page tables, etc.) translates virtual The virtual memory system (TLB, page tables, etc.) translates virtual
addresses to CPU physical addresses, which are stored as "phys_addr_t" or addresses to CPU physical addresses, which are stored as "phys_addr_t" or
...@@ -37,7 +39,7 @@ be restricted to a subset of that space. For example, even if a system ...@@ -37,7 +39,7 @@ be restricted to a subset of that space. For example, even if a system
supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU
so devices only need to use 32-bit DMA addresses. so devices only need to use 32-bit DMA addresses.
Here's a picture and some examples: Here's a picture and some examples::
CPU CPU Bus CPU CPU Bus
Virtual Physical Address Virtual Physical Address
...@@ -98,15 +100,16 @@ microprocessor architecture. You should use the DMA API rather than the ...@@ -98,15 +100,16 @@ microprocessor architecture. You should use the DMA API rather than the
bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
pci_map_*() interfaces. pci_map_*() interfaces.
First of all, you should make sure First of all, you should make sure::
#include <linux/dma-mapping.h> #include <linux/dma-mapping.h>
is in your driver, which provides the definition of dma_addr_t. This type is in your driver, which provides the definition of dma_addr_t. This type
can hold any valid DMA address for the platform and should be used can hold any valid DMA address for the platform and should be used
everywhere you hold a DMA address returned from the DMA mapping functions. everywhere you hold a DMA address returned from the DMA mapping functions.
What memory is DMA'able? What memory is DMA'able?
========================
The first piece of information you must know is what kernel memory can The first piece of information you must know is what kernel memory can
be used with the DMA mapping facilities. There has been an unwritten be used with the DMA mapping facilities. There has been an unwritten
...@@ -143,7 +146,8 @@ What about block I/O and networking buffers? The block I/O and ...@@ -143,7 +146,8 @@ What about block I/O and networking buffers? The block I/O and
networking subsystems make sure that the buffers they use are valid networking subsystems make sure that the buffers they use are valid
for you to DMA from/to. for you to DMA from/to.
DMA addressing limitations DMA addressing limitations
==========================
Does your device have any DMA addressing limitations? For example, is Does your device have any DMA addressing limitations? For example, is
your device only capable of driving the low order 24-bits of address? your device only capable of driving the low order 24-bits of address?
...@@ -166,7 +170,7 @@ style to do this even if your device holds the default setting, ...@@ -166,7 +170,7 @@ style to do this even if your device holds the default setting,
because this shows that you did think about these issues wrt. your because this shows that you did think about these issues wrt. your
device. device.
The query is performed via a call to dma_set_mask_and_coherent(): The query is performed via a call to dma_set_mask_and_coherent()::
int dma_set_mask_and_coherent(struct device *dev, u64 mask); int dma_set_mask_and_coherent(struct device *dev, u64 mask);
...@@ -175,12 +179,12 @@ If you have some special requirements, then the following two separate ...@@ -175,12 +179,12 @@ If you have some special requirements, then the following two separate
queries can be used instead: queries can be used instead:
The query for streaming mappings is performed via a call to The query for streaming mappings is performed via a call to
dma_set_mask(): dma_set_mask()::
int dma_set_mask(struct device *dev, u64 mask); int dma_set_mask(struct device *dev, u64 mask);
The query for consistent allocations is performed via a call The query for consistent allocations is performed via a call
to dma_set_coherent_mask(): to dma_set_coherent_mask()::
int dma_set_coherent_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask);
...@@ -209,7 +213,7 @@ of your driver reports that performance is bad or that the device is not ...@@ -209,7 +213,7 @@ of your driver reports that performance is bad or that the device is not
even detected, you can ask them for the kernel messages to find out even detected, you can ask them for the kernel messages to find out
exactly why. exactly why.
The standard 32-bit addressing device would do something like this: The standard 32-bit addressing device would do something like this::
if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32))) { if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32))) {
dev_warn(dev, "mydev: No suitable DMA available\n"); dev_warn(dev, "mydev: No suitable DMA available\n");
...@@ -225,7 +229,7 @@ than 64-bit addressing. For example, Sparc64 PCI SAC addressing is ...@@ -225,7 +229,7 @@ than 64-bit addressing. For example, Sparc64 PCI SAC addressing is
more efficient than DAC addressing. more efficient than DAC addressing.
Here is how you would handle a 64-bit capable device which can drive Here is how you would handle a 64-bit capable device which can drive
all 64-bits when accessing streaming DMA: all 64-bits when accessing streaming DMA::
int using_dac; int using_dac;
...@@ -239,7 +243,7 @@ all 64-bits when accessing streaming DMA: ...@@ -239,7 +243,7 @@ all 64-bits when accessing streaming DMA:
} }
If a card is capable of using 64-bit consistent allocations as well, If a card is capable of using 64-bit consistent allocations as well,
the case would look like this: the case would look like this::
int using_dac, consistent_using_dac; int using_dac, consistent_using_dac;
...@@ -260,7 +264,7 @@ uses consistent allocations, one would have to check the return value from ...@@ -260,7 +264,7 @@ uses consistent allocations, one would have to check the return value from
dma_set_coherent_mask(). dma_set_coherent_mask().
Finally, if your device can only drive the low 24-bits of Finally, if your device can only drive the low 24-bits of
address you might do something like: address you might do something like::
if (dma_set_mask(dev, DMA_BIT_MASK(24))) { if (dma_set_mask(dev, DMA_BIT_MASK(24))) {
dev_warn(dev, "mydev: 24-bit DMA addressing not available\n"); dev_warn(dev, "mydev: 24-bit DMA addressing not available\n");
...@@ -280,7 +284,7 @@ only provide the functionality which the machine can handle. It ...@@ -280,7 +284,7 @@ only provide the functionality which the machine can handle. It
is important that the last call to dma_set_mask() be for the is important that the last call to dma_set_mask() be for the
most specific mask. most specific mask.
Here is pseudo-code showing how this might be done: Here is pseudo-code showing how this might be done::
#define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32)
#define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24)
...@@ -308,7 +312,8 @@ A sound card was used as an example here because this genre of PCI ...@@ -308,7 +312,8 @@ A sound card was used as an example here because this genre of PCI
devices seems to be littered with ISA chips given a PCI front end, devices seems to be littered with ISA chips given a PCI front end,
and thus retaining the 16MB DMA addressing limitations of ISA. and thus retaining the 16MB DMA addressing limitations of ISA.
Types of DMA mappings Types of DMA mappings
=====================
There are two types of DMA mappings: There are two types of DMA mappings:
...@@ -336,12 +341,14 @@ There are two types of DMA mappings: ...@@ -336,12 +341,14 @@ There are two types of DMA mappings:
to memory is immediately visible to the device, and vice to memory is immediately visible to the device, and vice
versa. Consistent mappings guarantee this. versa. Consistent mappings guarantee this.
IMPORTANT: Consistent DMA memory does not preclude the usage of .. important::
proper memory barriers. The CPU may reorder stores to
Consistent DMA memory does not preclude the usage of
proper memory barriers. The CPU may reorder stores to
consistent memory just as it may normal memory. Example: consistent memory just as it may normal memory. Example:
if it is important for the device to see the first word if it is important for the device to see the first word
of a descriptor updated before the second, you must do of a descriptor updated before the second, you must do
something like: something like::
desc->word0 = address; desc->word0 = address;
wmb(); wmb();
...@@ -377,16 +384,17 @@ Also, systems with caches that aren't DMA-coherent will work better ...@@ -377,16 +384,17 @@ Also, systems with caches that aren't DMA-coherent will work better
when the underlying buffers don't share cache lines with other data. when the underlying buffers don't share cache lines with other data.
Using Consistent DMA mappings. Using Consistent DMA mappings
=============================
To allocate and map large (PAGE_SIZE or so) consistent DMA regions, To allocate and map large (PAGE_SIZE or so) consistent DMA regions,
you should do: you should do::
dma_addr_t dma_handle; dma_addr_t dma_handle;
cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp); cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp);
where device is a struct device *. This may be called in interrupt where device is a ``struct device *``. This may be called in interrupt
context with the GFP_ATOMIC flag. context with the GFP_ATOMIC flag.
Size is the length of the region you want to allocate, in bytes. Size is the length of the region you want to allocate, in bytes.
...@@ -415,7 +423,7 @@ exists (for example) to guarantee that if you allocate a chunk ...@@ -415,7 +423,7 @@ exists (for example) to guarantee that if you allocate a chunk
which is smaller than or equal to 64 kilobytes, the extent of the which is smaller than or equal to 64 kilobytes, the extent of the
buffer you receive will not cross a 64K boundary. buffer you receive will not cross a 64K boundary.
To unmap and free such a DMA region, you call: To unmap and free such a DMA region, you call::
dma_free_coherent(dev, size, cpu_addr, dma_handle); dma_free_coherent(dev, size, cpu_addr, dma_handle);
...@@ -430,7 +438,7 @@ a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages(). ...@@ -430,7 +438,7 @@ a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages().
Also, it understands common hardware constraints for alignment, Also, it understands common hardware constraints for alignment,
like queue heads needing to be aligned on N byte boundaries. like queue heads needing to be aligned on N byte boundaries.
Create a dma_pool like this: Create a dma_pool like this::
struct dma_pool *pool; struct dma_pool *pool;
...@@ -444,7 +452,7 @@ pass 0 for boundary; passing 4096 says memory allocated from this pool ...@@ -444,7 +452,7 @@ pass 0 for boundary; passing 4096 says memory allocated from this pool
must not cross 4KByte boundaries (but at that time it may be better to must not cross 4KByte boundaries (but at that time it may be better to
use dma_alloc_coherent() directly instead). use dma_alloc_coherent() directly instead).
Allocate memory from a DMA pool like this: Allocate memory from a DMA pool like this::
cpu_addr = dma_pool_alloc(pool, flags, &dma_handle); cpu_addr = dma_pool_alloc(pool, flags, &dma_handle);
...@@ -452,7 +460,7 @@ flags are GFP_KERNEL if blocking is permitted (not in_interrupt nor ...@@ -452,7 +460,7 @@ flags are GFP_KERNEL if blocking is permitted (not in_interrupt nor
holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(), holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(),
this returns two values, cpu_addr and dma_handle. this returns two values, cpu_addr and dma_handle.
Free memory that was allocated from a dma_pool like this: Free memory that was allocated from a dma_pool like this::
dma_pool_free(pool, cpu_addr, dma_handle); dma_pool_free(pool, cpu_addr, dma_handle);
...@@ -460,7 +468,7 @@ where pool is what you passed to dma_pool_alloc(), and cpu_addr and ...@@ -460,7 +468,7 @@ where pool is what you passed to dma_pool_alloc(), and cpu_addr and
dma_handle are the values dma_pool_alloc() returned. This function dma_handle are the values dma_pool_alloc() returned. This function
may be called in interrupt context. may be called in interrupt context.
Destroy a dma_pool by calling: Destroy a dma_pool by calling::
dma_pool_destroy(pool); dma_pool_destroy(pool);
...@@ -468,11 +476,12 @@ Make sure you've called dma_pool_free() for all memory allocated ...@@ -468,11 +476,12 @@ Make sure you've called dma_pool_free() for all memory allocated
from a pool before you destroy the pool. This function may not from a pool before you destroy the pool. This function may not
be called in interrupt context. be called in interrupt context.
DMA Direction DMA Direction
=============
The interfaces described in subsequent portions of this document The interfaces described in subsequent portions of this document
take a DMA direction argument, which is an integer and takes on take a DMA direction argument, which is an integer and takes on
one of the following values: one of the following values::
DMA_BIDIRECTIONAL DMA_BIDIRECTIONAL
DMA_TO_DEVICE DMA_TO_DEVICE
...@@ -521,14 +530,15 @@ packets, map/unmap them with the DMA_TO_DEVICE direction ...@@ -521,14 +530,15 @@ packets, map/unmap them with the DMA_TO_DEVICE direction
specifier. For receive packets, just the opposite, map/unmap them specifier. For receive packets, just the opposite, map/unmap them
with the DMA_FROM_DEVICE direction specifier. with the DMA_FROM_DEVICE direction specifier.
Using Streaming DMA mappings Using Streaming DMA mappings
============================
The streaming DMA mapping routines can be called from interrupt The streaming DMA mapping routines can be called from interrupt
context. There are two versions of each map/unmap, one which will context. There are two versions of each map/unmap, one which will
map/unmap a single memory region, and one which will map/unmap a map/unmap a single memory region, and one which will map/unmap a
scatterlist. scatterlist.
To map a single region, you do: To map a single region, you do::
struct device *dev = &my_dev->dev; struct device *dev = &my_dev->dev;
dma_addr_t dma_handle; dma_addr_t dma_handle;
...@@ -545,7 +555,7 @@ To map a single region, you do: ...@@ -545,7 +555,7 @@ To map a single region, you do:
goto map_error_handling; goto map_error_handling;
} }
and to unmap it: and to unmap it::
dma_unmap_single(dev, dma_handle, size, direction); dma_unmap_single(dev, dma_handle, size, direction);
...@@ -563,7 +573,7 @@ Using CPU pointers like this for single mappings has a disadvantage: ...@@ -563,7 +573,7 @@ Using CPU pointers like this for single mappings has a disadvantage:
you cannot reference HIGHMEM memory in this way. Thus, there is a you cannot reference HIGHMEM memory in this way. Thus, there is a
map/unmap interface pair akin to dma_{map,unmap}_single(). These map/unmap interface pair akin to dma_{map,unmap}_single(). These
interfaces deal with page/offset pairs instead of CPU pointers. interfaces deal with page/offset pairs instead of CPU pointers.
Specifically: Specifically::
struct device *dev = &my_dev->dev; struct device *dev = &my_dev->dev;
dma_addr_t dma_handle; dma_addr_t dma_handle;
...@@ -593,7 +603,7 @@ error as outlined under the dma_map_single() discussion. ...@@ -593,7 +603,7 @@ error as outlined under the dma_map_single() discussion.
You should call dma_unmap_page() when the DMA activity is finished, e.g., You should call dma_unmap_page() when the DMA activity is finished, e.g.,
from the interrupt which told you that the DMA transfer is done. from the interrupt which told you that the DMA transfer is done.
With scatterlists, you map a region gathered from several regions by: With scatterlists, you map a region gathered from several regions by::
int i, count = dma_map_sg(dev, sglist, nents, direction); int i, count = dma_map_sg(dev, sglist, nents, direction);
struct scatterlist *sg; struct scatterlist *sg;
...@@ -617,16 +627,18 @@ Then you should loop count times (note: this can be less than nents times) ...@@ -617,16 +627,18 @@ Then you should loop count times (note: this can be less than nents times)
and use sg_dma_address() and sg_dma_len() macros where you previously and use sg_dma_address() and sg_dma_len() macros where you previously
accessed sg->address and sg->length as shown above. accessed sg->address and sg->length as shown above.
To unmap a scatterlist, just call: To unmap a scatterlist, just call::
dma_unmap_sg(dev, sglist, nents, direction); dma_unmap_sg(dev, sglist, nents, direction);
Again, make sure DMA activity has already finished. Again, make sure DMA activity has already finished.
PLEASE NOTE: The 'nents' argument to the dma_unmap_sg call must be .. note::
the _same_ one you passed into the dma_map_sg call,
it should _NOT_ be the 'count' value _returned_ from the The 'nents' argument to the dma_unmap_sg call must be
dma_map_sg call. the _same_ one you passed into the dma_map_sg call,
it should _NOT_ be the 'count' value _returned_ from the
dma_map_sg call.
Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}() Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}()
counterpart, because the DMA address space is a shared resource and counterpart, because the DMA address space is a shared resource and
...@@ -638,11 +650,11 @@ properly in order for the CPU and device to see the most up-to-date and ...@@ -638,11 +650,11 @@ properly in order for the CPU and device to see the most up-to-date and
correct copy of the DMA buffer. correct copy of the DMA buffer.
So, firstly, just map it with dma_map_{single,sg}(), and after each DMA So, firstly, just map it with dma_map_{single,sg}(), and after each DMA
transfer call either: transfer call either::
dma_sync_single_for_cpu(dev, dma_handle, size, direction); dma_sync_single_for_cpu(dev, dma_handle, size, direction);
or: or::
dma_sync_sg_for_cpu(dev, sglist, nents, direction); dma_sync_sg_for_cpu(dev, sglist, nents, direction);
...@@ -650,17 +662,19 @@ as appropriate. ...@@ -650,17 +662,19 @@ as appropriate.
Then, if you wish to let the device get at the DMA area again, Then, if you wish to let the device get at the DMA area again,
finish accessing the data with the CPU, and then before actually finish accessing the data with the CPU, and then before actually
giving the buffer to the hardware call either: giving the buffer to the hardware call either::
dma_sync_single_for_device(dev, dma_handle, size, direction); dma_sync_single_for_device(dev, dma_handle, size, direction);
or: or::
dma_sync_sg_for_device(dev, sglist, nents, direction); dma_sync_sg_for_device(dev, sglist, nents, direction);
as appropriate. as appropriate.
PLEASE NOTE: The 'nents' argument to dma_sync_sg_for_cpu() and .. note::
The 'nents' argument to dma_sync_sg_for_cpu() and
dma_sync_sg_for_device() must be the same passed to dma_sync_sg_for_device() must be the same passed to
dma_map_sg(). It is _NOT_ the count returned by dma_map_sg(). It is _NOT_ the count returned by
dma_map_sg(). dma_map_sg().
...@@ -671,7 +685,7 @@ dma_map_*() call till dma_unmap_*(), then you don't have to call the ...@@ -671,7 +685,7 @@ dma_map_*() call till dma_unmap_*(), then you don't have to call the
dma_sync_*() routines at all. dma_sync_*() routines at all.
Here is pseudo code which shows a situation in which you would need Here is pseudo code which shows a situation in which you would need
to use the dma_sync_*() interfaces. to use the dma_sync_*() interfaces::
my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len)
{ {
...@@ -747,7 +761,8 @@ is planned to completely remove virt_to_bus() and bus_to_virt() as ...@@ -747,7 +761,8 @@ is planned to completely remove virt_to_bus() and bus_to_virt() as
they are entirely deprecated. Some ports already do not provide these they are entirely deprecated. Some ports already do not provide these
as it is impossible to correctly support them. as it is impossible to correctly support them.
Handling Errors Handling Errors
===============
DMA address space is limited on some architectures and an allocation DMA address space is limited on some architectures and an allocation
failure can be determined by: failure can be determined by:
...@@ -755,7 +770,7 @@ failure can be determined by: ...@@ -755,7 +770,7 @@ failure can be determined by:
- checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0 - checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0
- checking the dma_addr_t returned from dma_map_single() and dma_map_page() - checking the dma_addr_t returned from dma_map_single() and dma_map_page()
by using dma_mapping_error(): by using dma_mapping_error()::
dma_addr_t dma_handle; dma_addr_t dma_handle;
...@@ -773,7 +788,8 @@ failure can be determined by: ...@@ -773,7 +788,8 @@ failure can be determined by:
of a multiple page mapping attempt. These example are applicable to of a multiple page mapping attempt. These example are applicable to
dma_map_page() as well. dma_map_page() as well.
Example 1: Example 1::
dma_addr_t dma_handle1; dma_addr_t dma_handle1;
dma_addr_t dma_handle2; dma_addr_t dma_handle2;
...@@ -802,8 +818,12 @@ Example 1: ...@@ -802,8 +818,12 @@ Example 1:
dma_unmap_single(dma_handle1); dma_unmap_single(dma_handle1);
map_error_handling1: map_error_handling1:
Example 2: (if buffers are allocated in a loop, unmap all mapped buffers when Example 2::
mapping error is detected in the middle)
/*
* if buffers are allocated in a loop, unmap all mapped buffers when
* mapping error is detected in the middle
*/
dma_addr_t dma_addr; dma_addr_t dma_addr;
dma_addr_t array[DMA_BUFFERS]; dma_addr_t array[DMA_BUFFERS];
...@@ -846,7 +866,8 @@ SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping ...@@ -846,7 +866,8 @@ SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping
fails in the queuecommand hook. This means that the SCSI subsystem fails in the queuecommand hook. This means that the SCSI subsystem
passes the command to the driver again later. passes the command to the driver again later.
Optimizing Unmap State Space Consumption Optimizing Unmap State Space Consumption
========================================
On many platforms, dma_unmap_{single,page}() is simply a nop. On many platforms, dma_unmap_{single,page}() is simply a nop.
Therefore, keeping track of the mapping address and length is a waste Therefore, keeping track of the mapping address and length is a waste
...@@ -858,7 +879,7 @@ Actually, instead of describing the macros one by one, we'll ...@@ -858,7 +879,7 @@ Actually, instead of describing the macros one by one, we'll
transform some example code. transform some example code.
1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures. 1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures.
Example, before: Example, before::
struct ring_state { struct ring_state {
struct sk_buff *skb; struct sk_buff *skb;
...@@ -866,7 +887,7 @@ transform some example code. ...@@ -866,7 +887,7 @@ transform some example code.
__u32 len; __u32 len;
}; };
after: after::
struct ring_state { struct ring_state {
struct sk_buff *skb; struct sk_buff *skb;
...@@ -875,23 +896,23 @@ transform some example code. ...@@ -875,23 +896,23 @@ transform some example code.
}; };
2) Use dma_unmap_{addr,len}_set() to set these values. 2) Use dma_unmap_{addr,len}_set() to set these values.
Example, before: Example, before::
ringp->mapping = FOO; ringp->mapping = FOO;
ringp->len = BAR; ringp->len = BAR;
after: after::
dma_unmap_addr_set(ringp, mapping, FOO); dma_unmap_addr_set(ringp, mapping, FOO);
dma_unmap_len_set(ringp, len, BAR); dma_unmap_len_set(ringp, len, BAR);
3) Use dma_unmap_{addr,len}() to access these values. 3) Use dma_unmap_{addr,len}() to access these values.
Example, before: Example, before::
dma_unmap_single(dev, ringp->mapping, ringp->len, dma_unmap_single(dev, ringp->mapping, ringp->len,
DMA_FROM_DEVICE); DMA_FROM_DEVICE);
after: after::
dma_unmap_single(dev, dma_unmap_single(dev,
dma_unmap_addr(ringp, mapping), dma_unmap_addr(ringp, mapping),
...@@ -902,7 +923,8 @@ It really should be self-explanatory. We treat the ADDR and LEN ...@@ -902,7 +923,8 @@ It really should be self-explanatory. We treat the ADDR and LEN
separately, because it is possible for an implementation to only separately, because it is possible for an implementation to only
need the address in order to perform the unmap operation. need the address in order to perform the unmap operation.
Platform Issues Platform Issues
===============
If you are just writing drivers for Linux and do not maintain If you are just writing drivers for Linux and do not maintain
an architecture port for the kernel, you can safely skip down an architecture port for the kernel, you can safely skip down
...@@ -928,12 +950,13 @@ to "Closing". ...@@ -928,12 +950,13 @@ to "Closing".
alignment constraints (e.g. the alignment constraints about 64-bit alignment constraints (e.g. the alignment constraints about 64-bit
objects). objects).
Closing Closing
=======
This document, and the API itself, would not be in its current This document, and the API itself, would not be in its current
form without the feedback and suggestions from numerous individuals. form without the feedback and suggestions from numerous individuals.
We would like to specifically mention, in no particular order, the We would like to specifically mention, in no particular order, the
following people: following people::
Russell King <rmk@arm.linux.org.uk> Russell King <rmk@arm.linux.org.uk>
Leo Dagum <dagum@barrel.engr.sgi.com> Leo Dagum <dagum@barrel.engr.sgi.com>
......
此差异已折叠。
DMA with ISA and LPC devices ============================
============================ DMA with ISA and LPC devices
============================
Pierre Ossman <drzeus@drzeus.cx> :Author: Pierre Ossman <drzeus@drzeus.cx>
This document describes how to do DMA transfers using the old ISA DMA This document describes how to do DMA transfers using the old ISA DMA
controller. Even though ISA is more or less dead today the LPC bus controller. Even though ISA is more or less dead today the LPC bus
uses the same DMA system so it will be around for quite some time. uses the same DMA system so it will be around for quite some time.
Part I - Headers and dependencies Headers and dependencies
--------------------------------- ------------------------
To do ISA style DMA you need to include two headers: To do ISA style DMA you need to include two headers::
#include <linux/dma-mapping.h> #include <linux/dma-mapping.h>
#include <asm/dma.h> #include <asm/dma.h>
The first is the generic DMA API used to convert virtual addresses to The first is the generic DMA API used to convert virtual addresses to
bus addresses (see Documentation/DMA-API.txt for details). bus addresses (see Documentation/DMA-API.txt for details).
...@@ -23,8 +24,8 @@ this is not present on all platforms make sure you construct your ...@@ -23,8 +24,8 @@ this is not present on all platforms make sure you construct your
Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
to build your driver on unsupported platforms. to build your driver on unsupported platforms.
Part II - Buffer allocation Buffer allocation
--------------------------- -----------------
The ISA DMA controller has some very strict requirements on which The ISA DMA controller has some very strict requirements on which
memory it can access so extra care must be taken when allocating memory it can access so extra care must be taken when allocating
...@@ -47,8 +48,8 @@ __GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder. ...@@ -47,8 +48,8 @@ __GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
(This scarcity also means that you should allocate the buffer as (This scarcity also means that you should allocate the buffer as
early as possible and not release it until the driver is unloaded.) early as possible and not release it until the driver is unloaded.)
Part III - Address translation Address translation
------------------------------ -------------------
To translate the virtual address to a bus address, use the normal DMA To translate the virtual address to a bus address, use the normal DMA
API. Do _not_ use isa_virt_to_phys() even though it does the same API. Do _not_ use isa_virt_to_phys() even though it does the same
...@@ -61,8 +62,8 @@ Note: x86_64 had a broken DMA API when it came to ISA but has since ...@@ -61,8 +62,8 @@ Note: x86_64 had a broken DMA API when it came to ISA but has since
been fixed. If your arch has problems then fix the DMA API instead of been fixed. If your arch has problems then fix the DMA API instead of
reverting to the ISA functions. reverting to the ISA functions.
Part IV - Channels Channels
------------------ --------
A normal ISA DMA controller has 8 channels. The lower four are for A normal ISA DMA controller has 8 channels. The lower four are for
8-bit transfers and the upper four are for 16-bit transfers. 8-bit transfers and the upper four are for 16-bit transfers.
...@@ -80,8 +81,8 @@ The ability to use 16-bit or 8-bit transfers is _not_ up to you as a ...@@ -80,8 +81,8 @@ The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
driver author but depends on what the hardware supports. Check your driver author but depends on what the hardware supports. Check your
specs or test different channels. specs or test different channels.
Part V - Transfer data Transfer data
---------------------- -------------
Now for the good stuff, the actual DMA transfer. :) Now for the good stuff, the actual DMA transfer. :)
...@@ -112,37 +113,37 @@ Once the DMA transfer is finished (or timed out) you should disable ...@@ -112,37 +113,37 @@ Once the DMA transfer is finished (or timed out) you should disable
the channel again. You should also check get_dma_residue() to make the channel again. You should also check get_dma_residue() to make
sure that all data has been transferred. sure that all data has been transferred.
Example: Example::
int flags, residue; int flags, residue;
flags = claim_dma_lock(); flags = claim_dma_lock();
clear_dma_ff(); clear_dma_ff();
set_dma_mode(channel, DMA_MODE_WRITE); set_dma_mode(channel, DMA_MODE_WRITE);
set_dma_addr(channel, phys_addr); set_dma_addr(channel, phys_addr);
set_dma_count(channel, num_bytes); set_dma_count(channel, num_bytes);
dma_enable(channel); dma_enable(channel);
release_dma_lock(flags); release_dma_lock(flags);
while (!device_done()); while (!device_done());
flags = claim_dma_lock(); flags = claim_dma_lock();
dma_disable(channel); dma_disable(channel);
residue = dma_get_residue(channel); residue = dma_get_residue(channel);
if (residue != 0) if (residue != 0)
printk(KERN_ERR "driver: Incomplete DMA transfer!" printk(KERN_ERR "driver: Incomplete DMA transfer!"
" %d bytes left!\n", residue); " %d bytes left!\n", residue);
release_dma_lock(flags); release_dma_lock(flags);
Part VI - Suspend/resume Suspend/resume
------------------------ --------------
It is the driver's responsibility to make sure that the machine isn't It is the driver's responsibility to make sure that the machine isn't
suspended while a DMA transfer is in progress. Also, all DMA settings suspended while a DMA transfer is in progress. Also, all DMA settings
......
DMA attributes ==============
============== DMA attributes
==============
This document describes the semantics of the DMA attributes that are This document describes the semantics of the DMA attributes that are
defined in linux/dma-mapping.h. defined in linux/dma-mapping.h.
...@@ -108,6 +109,7 @@ This is a hint to the DMA-mapping subsystem that it's probably not worth ...@@ -108,6 +109,7 @@ This is a hint to the DMA-mapping subsystem that it's probably not worth
the time to try to allocate memory to in a way that gives better TLB the time to try to allocate memory to in a way that gives better TLB
efficiency (AKA it's not worth trying to build the mapping out of larger efficiency (AKA it's not worth trying to build the mapping out of larger
pages). You might want to specify this if: pages). You might want to specify this if:
- You know that the accesses to this memory won't thrash the TLB. - You know that the accesses to this memory won't thrash the TLB.
You might know that the accesses are likely to be sequential or You might know that the accesses are likely to be sequential or
that they aren't sequential but it's unlikely you'll ping-pong that they aren't sequential but it's unlikely you'll ping-pong
...@@ -121,11 +123,12 @@ pages). You might want to specify this if: ...@@ -121,11 +123,12 @@ pages). You might want to specify this if:
the mapping to have a short lifetime then it may be worth it to the mapping to have a short lifetime then it may be worth it to
optimize allocation (avoid coming up with large pages) instead of optimize allocation (avoid coming up with large pages) instead of
getting the slight performance win of larger pages. getting the slight performance win of larger pages.
Setting this hint doesn't guarantee that you won't get huge pages, but it Setting this hint doesn't guarantee that you won't get huge pages, but it
means that we won't try quite as hard to get them. means that we won't try quite as hard to get them.
NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, .. note:: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM,
though ARM64 patches will likely be posted soon. though ARM64 patches will likely be posted soon.
DMA_ATTR_NO_WARN DMA_ATTR_NO_WARN
---------------- ----------------
...@@ -142,10 +145,10 @@ problem at all, depending on the implementation of the retry mechanism. ...@@ -142,10 +145,10 @@ problem at all, depending on the implementation of the retry mechanism.
So, this provides a way for drivers to avoid those error messages on calls So, this provides a way for drivers to avoid those error messages on calls
where allocation failures are not a problem, and shouldn't bother the logs. where allocation failures are not a problem, and shouldn't bother the logs.
NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. .. note:: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC.
DMA_ATTR_PRIVILEGED DMA_ATTR_PRIVILEGED
------------------------------ -------------------
Some advanced peripherals such as remote processors and GPUs perform Some advanced peripherals such as remote processors and GPUs perform
accesses to DMA buffers in both privileged "supervisor" and unprivileged accesses to DMA buffers in both privileged "supervisor" and unprivileged
......
=====================
The Linux IPMI Driver
=====================
The Linux IPMI Driver :Author: Corey Minyard <minyard@mvista.com> / <minyard@acm.org>
---------------------
Corey Minyard
<minyard@mvista.com>
<minyard@acm.org>
The Intelligent Platform Management Interface, or IPMI, is a The Intelligent Platform Management Interface, or IPMI, is a
standard for controlling intelligent devices that monitor a system. standard for controlling intelligent devices that monitor a system.
...@@ -141,7 +140,7 @@ Addressing ...@@ -141,7 +140,7 @@ Addressing
---------- ----------
The IPMI addressing works much like IP addresses, you have an overlay The IPMI addressing works much like IP addresses, you have an overlay
to handle the different address types. The overlay is: to handle the different address types. The overlay is::
struct ipmi_addr struct ipmi_addr
{ {
...@@ -153,7 +152,7 @@ to handle the different address types. The overlay is: ...@@ -153,7 +152,7 @@ to handle the different address types. The overlay is:
The addr_type determines what the address really is. The driver The addr_type determines what the address really is. The driver
currently understands two different types of addresses. currently understands two different types of addresses.
"System Interface" addresses are defined as: "System Interface" addresses are defined as::
struct ipmi_system_interface_addr struct ipmi_system_interface_addr
{ {
...@@ -166,7 +165,7 @@ straight to the BMC on the current card. The channel must be ...@@ -166,7 +165,7 @@ straight to the BMC on the current card. The channel must be
IPMI_BMC_CHANNEL. IPMI_BMC_CHANNEL.
Messages that are destined to go out on the IPMB bus use the Messages that are destined to go out on the IPMB bus use the
IPMI_IPMB_ADDR_TYPE address type. The format is IPMI_IPMB_ADDR_TYPE address type. The format is::
struct ipmi_ipmb_addr struct ipmi_ipmb_addr
{ {
...@@ -184,16 +183,16 @@ spec. ...@@ -184,16 +183,16 @@ spec.
Messages Messages
-------- --------
Messages are defined as: Messages are defined as::
struct ipmi_msg struct ipmi_msg
{ {
unsigned char netfn; unsigned char netfn;
unsigned char lun; unsigned char lun;
unsigned char cmd; unsigned char cmd;
unsigned char *data; unsigned char *data;
int data_len; int data_len;
}; };
The driver takes care of adding/stripping the header information. The The driver takes care of adding/stripping the header information. The
data portion is just the data to be send (do NOT put addressing info data portion is just the data to be send (do NOT put addressing info
...@@ -208,7 +207,7 @@ block of data, even when receiving messages. Otherwise the driver ...@@ -208,7 +207,7 @@ block of data, even when receiving messages. Otherwise the driver
will have no place to put the message. will have no place to put the message.
Messages coming up from the message handler in kernelland will come in Messages coming up from the message handler in kernelland will come in
as: as::
struct ipmi_recv_msg struct ipmi_recv_msg
{ {
...@@ -246,6 +245,7 @@ and the user should not have to care what type of SMI is below them. ...@@ -246,6 +245,7 @@ and the user should not have to care what type of SMI is below them.
Watching For Interfaces Watching For Interfaces
^^^^^^^^^^^^^^^^^^^^^^^
When your code comes up, the IPMI driver may or may not have detected When your code comes up, the IPMI driver may or may not have detected
if IPMI devices exist. So you might have to defer your setup until if IPMI devices exist. So you might have to defer your setup until
...@@ -256,6 +256,7 @@ and tell you when they come and go. ...@@ -256,6 +256,7 @@ and tell you when they come and go.
Creating the User Creating the User
^^^^^^^^^^^^^^^^^
To use the message handler, you must first create a user using To use the message handler, you must first create a user using
ipmi_create_user. The interface number specifies which SMI you want ipmi_create_user. The interface number specifies which SMI you want
...@@ -272,6 +273,7 @@ closing the device automatically destroys the user. ...@@ -272,6 +273,7 @@ closing the device automatically destroys the user.
Messaging Messaging
^^^^^^^^^
To send a message from kernel-land, the ipmi_request_settime() call does To send a message from kernel-land, the ipmi_request_settime() call does
pretty much all message handling. Most of the parameter are pretty much all message handling. Most of the parameter are
...@@ -321,6 +323,7 @@ though, since it is tricky to manage your own buffers. ...@@ -321,6 +323,7 @@ though, since it is tricky to manage your own buffers.
Events and Incoming Commands Events and Incoming Commands
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The driver takes care of polling for IPMI events and receiving The driver takes care of polling for IPMI events and receiving
commands (commands are messages that are not responses, they are commands (commands are messages that are not responses, they are
...@@ -367,7 +370,7 @@ in the system. It discovers interfaces through a host of different ...@@ -367,7 +370,7 @@ in the system. It discovers interfaces through a host of different
methods, depending on the system. methods, depending on the system.
You can specify up to four interfaces on the module load line and You can specify up to four interfaces on the module load line and
control some module parameters: control some module parameters::
modprobe ipmi_si.o type=<type1>,<type2>.... modprobe ipmi_si.o type=<type1>,<type2>....
ports=<port1>,<port2>... addrs=<addr1>,<addr2>... ports=<port1>,<port2>... addrs=<addr1>,<addr2>...
...@@ -437,7 +440,7 @@ default is one. Setting to 0 is useful with the hotmod, but is ...@@ -437,7 +440,7 @@ default is one. Setting to 0 is useful with the hotmod, but is
obviously only useful for modules. obviously only useful for modules.
When compiled into the kernel, the parameters can be specified on the When compiled into the kernel, the parameters can be specified on the
kernel command line as: kernel command line as::
ipmi_si.type=<type1>,<type2>... ipmi_si.type=<type1>,<type2>...
ipmi_si.ports=<port1>,<port2>... ipmi_si.addrs=<addr1>,<addr2>... ipmi_si.ports=<port1>,<port2>... ipmi_si.addrs=<addr1>,<addr2>...
...@@ -474,16 +477,22 @@ The driver supports a hot add and remove of interfaces. This way, ...@@ -474,16 +477,22 @@ The driver supports a hot add and remove of interfaces. This way,
interfaces can be added or removed after the kernel is up and running. interfaces can be added or removed after the kernel is up and running.
This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a
write-only parameter. You write a string to this interface. The string write-only parameter. You write a string to this interface. The string
has the format: has the format::
<op1>[:op2[:op3...]] <op1>[:op2[:op3...]]
The "op"s are:
The "op"s are::
add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]] add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]]
You can specify more than one interface on the line. The "opt"s are:
You can specify more than one interface on the line. The "opt"s are::
rsp=<regspacing> rsp=<regspacing>
rsi=<regsize> rsi=<regsize>
rsh=<regshift> rsh=<regshift>
irq=<irq> irq=<irq>
ipmb=<ipmb slave addr> ipmb=<ipmb slave addr>
and these have the same meanings as discussed above. Note that you and these have the same meanings as discussed above. Note that you
can also use this on the kernel command line for a more compact format can also use this on the kernel command line for a more compact format
for specifying an interface. Note that when removing an interface, for specifying an interface. Note that when removing an interface,
...@@ -496,7 +505,7 @@ The SMBus Driver (SSIF) ...@@ -496,7 +505,7 @@ The SMBus Driver (SSIF)
The SMBus driver allows up to 4 SMBus devices to be configured in the The SMBus driver allows up to 4 SMBus devices to be configured in the
system. By default, the driver will only register with something it system. By default, the driver will only register with something it
finds in DMI or ACPI tables. You can change this finds in DMI or ACPI tables. You can change this
at module load time (for a module) with: at module load time (for a module) with::
modprobe ipmi_ssif.o modprobe ipmi_ssif.o
addr=<i2caddr1>[,<i2caddr2>[,...]] addr=<i2caddr1>[,<i2caddr2>[,...]]
...@@ -535,7 +544,7 @@ the smb_addr parameter unless you have DMI or ACPI data to tell the ...@@ -535,7 +544,7 @@ the smb_addr parameter unless you have DMI or ACPI data to tell the
driver what to use. driver what to use.
When compiled into the kernel, the addresses can be specified on the When compiled into the kernel, the addresses can be specified on the
kernel command line as: kernel command line as::
ipmb_ssif.addr=<i2caddr1>[,<i2caddr2>[...]] ipmb_ssif.addr=<i2caddr1>[,<i2caddr2>[...]]
ipmi_ssif.adapter=<adapter1>[,<adapter2>[...]] ipmi_ssif.adapter=<adapter1>[,<adapter2>[...]]
...@@ -565,9 +574,9 @@ Some users need more detailed information about a device, like where ...@@ -565,9 +574,9 @@ Some users need more detailed information about a device, like where
the address came from or the raw base device for the IPMI interface. the address came from or the raw base device for the IPMI interface.
You can use the IPMI smi_watcher to catch the IPMI interfaces as they You can use the IPMI smi_watcher to catch the IPMI interfaces as they
come or go, and to grab the information, you can use the function come or go, and to grab the information, you can use the function
ipmi_get_smi_info(), which returns the following structure: ipmi_get_smi_info(), which returns the following structure::
struct ipmi_smi_info { struct ipmi_smi_info {
enum ipmi_addr_src addr_src; enum ipmi_addr_src addr_src;
struct device *dev; struct device *dev;
union { union {
...@@ -575,7 +584,7 @@ struct ipmi_smi_info { ...@@ -575,7 +584,7 @@ struct ipmi_smi_info {
void *acpi_handle; void *acpi_handle;
} acpi_info; } acpi_info;
} addr_info; } addr_info;
}; };
Currently special info for only for SI_ACPI address sources is Currently special info for only for SI_ACPI address sources is
returned. Others may be added as necessary. returned. Others may be added as necessary.
...@@ -590,7 +599,7 @@ Watchdog ...@@ -590,7 +599,7 @@ Watchdog
A watchdog timer is provided that implements the Linux-standard A watchdog timer is provided that implements the Linux-standard
watchdog timer interface. It has three module parameters that can be watchdog timer interface. It has three module parameters that can be
used to control it: used to control it::
modprobe ipmi_watchdog timeout=<t> pretimeout=<t> action=<action type> modprobe ipmi_watchdog timeout=<t> pretimeout=<t> action=<action type>
preaction=<preaction type> preop=<preop type> start_now=x preaction=<preaction type> preop=<preop type> start_now=x
...@@ -635,7 +644,7 @@ watchdog device is closed. The default value of nowayout is true ...@@ -635,7 +644,7 @@ watchdog device is closed. The default value of nowayout is true
if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not. if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not.
When compiled into the kernel, the kernel command line is available When compiled into the kernel, the kernel command line is available
for configuring the watchdog: for configuring the watchdog::
ipmi_watchdog.timeout=<t> ipmi_watchdog.pretimeout=<t> ipmi_watchdog.timeout=<t> ipmi_watchdog.pretimeout=<t>
ipmi_watchdog.action=<action type> ipmi_watchdog.action=<action type>
...@@ -675,6 +684,7 @@ also get a bunch of OEM events holding the panic string. ...@@ -675,6 +684,7 @@ also get a bunch of OEM events holding the panic string.
The field settings of the events are: The field settings of the events are:
* Generator ID: 0x21 (kernel) * Generator ID: 0x21 (kernel)
* EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format) * EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format)
* Sensor Type: 0x20 (OS critical stop sensor) * Sensor Type: 0x20 (OS critical stop sensor)
...@@ -683,18 +693,20 @@ The field settings of the events are: ...@@ -683,18 +693,20 @@ The field settings of the events are:
* Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3) * Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3)
* Event data 2: second byte of panic string * Event data 2: second byte of panic string
* Event data 3: third byte of panic string * Event data 3: third byte of panic string
See the IPMI spec for the details of the event layout. This event is See the IPMI spec for the details of the event layout. This event is
always sent to the local management controller. It will handle routing always sent to the local management controller. It will handle routing
the message to the right place the message to the right place
Other OEM events have the following format: Other OEM events have the following format:
Record ID (bytes 0-1): Set by the SEL.
Record type (byte 2): 0xf0 (OEM non-timestamped) * Record ID (bytes 0-1): Set by the SEL.
byte 3: The slave address of the card saving the panic * Record type (byte 2): 0xf0 (OEM non-timestamped)
byte 4: A sequence number (starting at zero) * byte 3: The slave address of the card saving the panic
The rest of the bytes (11 bytes) are the panic string. If the panic string * byte 4: A sequence number (starting at zero)
is longer than 11 bytes, multiple messages will be sent with increasing The rest of the bytes (11 bytes) are the panic string. If the panic string
sequence numbers. is longer than 11 bytes, multiple messages will be sent with increasing
sequence numbers.
Because you cannot send OEM events using the standard interface, this Because you cannot send OEM events using the standard interface, this
function will attempt to find an SEL and add the events there. It function will attempt to find an SEL and add the events there. It
......
================
SMP IRQ affinity
================
ChangeLog: ChangeLog:
Started by Ingo Molnar <mingo@redhat.com> - Started by Ingo Molnar <mingo@redhat.com>
Update by Max Krasnyansky <maxk@qualcomm.com> - Update by Max Krasnyansky <maxk@qualcomm.com>
SMP IRQ affinity
/proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify /proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify
which target CPUs are permitted for a given IRQ source. It's a bitmask which target CPUs are permitted for a given IRQ source. It's a bitmask
...@@ -16,50 +19,52 @@ will be set to the default mask. It can then be changed as described above. ...@@ -16,50 +19,52 @@ will be set to the default mask. It can then be changed as described above.
Default mask is 0xffffffff. Default mask is 0xffffffff.
Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
it to CPU4-7 (this is an 8-CPU SMP box): it to CPU4-7 (this is an 8-CPU SMP box)::
[root@moon 44]# cd /proc/irq/44 [root@moon 44]# cd /proc/irq/44
[root@moon 44]# cat smp_affinity [root@moon 44]# cat smp_affinity
ffffffff ffffffff
[root@moon 44]# echo 0f > smp_affinity [root@moon 44]# echo 0f > smp_affinity
[root@moon 44]# cat smp_affinity [root@moon 44]# cat smp_affinity
0000000f 0000000f
[root@moon 44]# ping -f h [root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes PING hell (195.4.7.3): 56 data bytes
... ...
--- hell ping statistics --- --- hell ping statistics ---
6029 packets transmitted, 6027 packets received, 0% packet loss 6029 packets transmitted, 6027 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/0.4 ms round-trip min/avg/max = 0.1/0.1/0.4 ms
[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:' [root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
As can be seen from the line above IRQ44 was delivered only to the first four As can be seen from the line above IRQ44 was delivered only to the first four
processors (0-3). processors (0-3).
Now lets restrict that IRQ to CPU(4-7). Now lets restrict that IRQ to CPU(4-7).
[root@moon 44]# echo f0 > smp_affinity ::
[root@moon 44]# cat smp_affinity
000000f0 [root@moon 44]# echo f0 > smp_affinity
[root@moon 44]# ping -f h [root@moon 44]# cat smp_affinity
PING hell (195.4.7.3): 56 data bytes 000000f0
.. [root@moon 44]# ping -f h
--- hell ping statistics --- PING hell (195.4.7.3): 56 data bytes
2779 packets transmitted, 2777 packets received, 0% packet loss ..
round-trip min/avg/max = 0.1/0.5/585.4 ms --- hell ping statistics ---
[root@moon 44]# cat /proc/interrupts | 'CPU\|44:' 2779 packets transmitted, 2777 packets received, 0% packet loss
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 round-trip min/avg/max = 0.1/0.5/585.4 ms
44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1 [root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
This time around IRQ44 was delivered only to the last four processors. This time around IRQ44 was delivered only to the last four processors.
i.e counters for the CPU0-3 did not change. i.e counters for the CPU0-3 did not change.
Here is an example of limiting that same irq (44) to cpus 1024 to 1031: Here is an example of limiting that same irq (44) to cpus 1024 to 1031::
[root@moon 44]# echo 1024-1031 > smp_affinity_list [root@moon 44]# echo 1024-1031 > smp_affinity_list
[root@moon 44]# cat smp_affinity_list [root@moon 44]# cat smp_affinity_list
1024-1031 1024-1031
Note that to do this with a bitmask would require 32 bitmasks of zero Note that to do this with a bitmask would require 32 bitmasks of zero
to follow the pertinent one. to follow the pertinent one.
irq_domain interrupt number mapping library ===============================================
The irq_domain interrupt number mapping library
===============================================
The current design of the Linux kernel uses a single large number The current design of the Linux kernel uses a single large number
space where each separate IRQ source is assigned a different number. space where each separate IRQ source is assigned a different number.
...@@ -36,7 +38,9 @@ irq_domain also implements translation from an abstract irq_fwspec ...@@ -36,7 +38,9 @@ irq_domain also implements translation from an abstract irq_fwspec
structure to hwirq numbers (Device Tree and ACPI GSI so far), and can structure to hwirq numbers (Device Tree and ACPI GSI so far), and can
be easily extended to support other IRQ topology data sources. be easily extended to support other IRQ topology data sources.
=== irq_domain usage === irq_domain usage
================
An interrupt controller driver creates and registers an irq_domain by An interrupt controller driver creates and registers an irq_domain by
calling one of the irq_domain_add_*() functions (each mapping method calling one of the irq_domain_add_*() functions (each mapping method
has a different allocator function, more on that later). The function has a different allocator function, more on that later). The function
...@@ -62,15 +66,21 @@ If the driver has the Linux IRQ number or the irq_data pointer, and ...@@ -62,15 +66,21 @@ If the driver has the Linux IRQ number or the irq_data pointer, and
needs to know the associated hwirq number (such as in the irq_chip needs to know the associated hwirq number (such as in the irq_chip
callbacks) then it can be directly obtained from irq_data->hwirq. callbacks) then it can be directly obtained from irq_data->hwirq.
=== Types of irq_domain mappings === Types of irq_domain mappings
============================
There are several mechanisms available for reverse mapping from hwirq There are several mechanisms available for reverse mapping from hwirq
to Linux irq, and each mechanism uses a different allocation function. to Linux irq, and each mechanism uses a different allocation function.
Which reverse map type should be used depends on the use case. Each Which reverse map type should be used depends on the use case. Each
of the reverse map types are described below: of the reverse map types are described below:
==== Linear ==== Linear
irq_domain_add_linear() ------
irq_domain_create_linear()
::
irq_domain_add_linear()
irq_domain_create_linear()
The linear reverse map maintains a fixed size table indexed by the The linear reverse map maintains a fixed size table indexed by the
hwirq number. When a hwirq is mapped, an irq_desc is allocated for hwirq number. When a hwirq is mapped, an irq_desc is allocated for
...@@ -89,9 +99,13 @@ accepts a more general abstraction 'struct fwnode_handle'. ...@@ -89,9 +99,13 @@ accepts a more general abstraction 'struct fwnode_handle'.
The majority of drivers should use the linear map. The majority of drivers should use the linear map.
==== Tree ==== Tree
irq_domain_add_tree() ----
irq_domain_create_tree()
::
irq_domain_add_tree()
irq_domain_create_tree()
The irq_domain maintains a radix tree map from hwirq numbers to Linux The irq_domain maintains a radix tree map from hwirq numbers to Linux
IRQs. When an hwirq is mapped, an irq_desc is allocated and the IRQs. When an hwirq is mapped, an irq_desc is allocated and the
...@@ -109,8 +123,12 @@ accepts a more general abstraction 'struct fwnode_handle'. ...@@ -109,8 +123,12 @@ accepts a more general abstraction 'struct fwnode_handle'.
Very few drivers should need this mapping. Very few drivers should need this mapping.
==== No Map ===- No Map
irq_domain_add_nomap() ------
::
irq_domain_add_nomap()
The No Map mapping is to be used when the hwirq number is The No Map mapping is to be used when the hwirq number is
programmable in the hardware. In this case it is best to program the programmable in the hardware. In this case it is best to program the
...@@ -121,10 +139,14 @@ Linux IRQ number into the hardware. ...@@ -121,10 +139,14 @@ Linux IRQ number into the hardware.
Most drivers cannot use this mapping. Most drivers cannot use this mapping.
==== Legacy ==== Legacy
irq_domain_add_simple() ------
irq_domain_add_legacy()
irq_domain_add_legacy_isa() ::
irq_domain_add_simple()
irq_domain_add_legacy()
irq_domain_add_legacy_isa()
The Legacy mapping is a special case for drivers that already have a The Legacy mapping is a special case for drivers that already have a
range of irq_descs allocated for the hwirqs. It is used when the range of irq_descs allocated for the hwirqs. It is used when the
...@@ -163,14 +185,17 @@ that the driver using the simple domain call irq_create_mapping() ...@@ -163,14 +185,17 @@ that the driver using the simple domain call irq_create_mapping()
before any irq_find_mapping() since the latter will actually work before any irq_find_mapping() since the latter will actually work
for the static IRQ assignment case. for the static IRQ assignment case.
==== Hierarchy IRQ domain ==== Hierarchy IRQ domain
--------------------
On some architectures, there may be multiple interrupt controllers On some architectures, there may be multiple interrupt controllers
involved in delivering an interrupt from the device to the target CPU. involved in delivering an interrupt from the device to the target CPU.
Let's look at a typical interrupt delivering path on x86 platforms: Let's look at a typical interrupt delivering path on x86 platforms::
Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
There are three interrupt controllers involved: There are three interrupt controllers involved:
1) IOAPIC controller 1) IOAPIC controller
2) Interrupt remapping controller 2) Interrupt remapping controller
3) Local APIC controller 3) Local APIC controller
...@@ -180,7 +205,8 @@ hardware architecture, an irq_domain data structure is built for each ...@@ -180,7 +205,8 @@ hardware architecture, an irq_domain data structure is built for each
interrupt controller and those irq_domains are organized into hierarchy. interrupt controller and those irq_domains are organized into hierarchy.
When building irq_domain hierarchy, the irq_domain near to the device is When building irq_domain hierarchy, the irq_domain near to the device is
child and the irq_domain near to CPU is parent. So a hierarchy structure child and the irq_domain near to CPU is parent. So a hierarchy structure
as below will be built for the example above. as below will be built for the example above::
CPU Vector irq_domain (root irq_domain to manage CPU vectors) CPU Vector irq_domain (root irq_domain to manage CPU vectors)
^ ^
| |
...@@ -190,6 +216,7 @@ as below will be built for the example above. ...@@ -190,6 +216,7 @@ as below will be built for the example above.
IOAPIC irq_domain (manage IOAPIC delivery entries/pins) IOAPIC irq_domain (manage IOAPIC delivery entries/pins)
There are four major interfaces to use hierarchy irq_domain: There are four major interfaces to use hierarchy irq_domain:
1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt 1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt
controller related resources to deliver these interrupts. controller related resources to deliver these interrupts.
2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller 2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller
...@@ -199,7 +226,8 @@ There are four major interfaces to use hierarchy irq_domain: ...@@ -199,7 +226,8 @@ There are four major interfaces to use hierarchy irq_domain:
4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware 4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware
to stop delivering the interrupt. to stop delivering the interrupt.
Following changes are needed to support hierarchy irq_domain. Following changes are needed to support hierarchy irq_domain:
1) a new field 'parent' is added to struct irq_domain; it's used to 1) a new field 'parent' is added to struct irq_domain; it's used to
maintain irq_domain hierarchy information. maintain irq_domain hierarchy information.
2) a new field 'parent_data' is added to struct irq_data; it's used to 2) a new field 'parent_data' is added to struct irq_data; it's used to
...@@ -223,6 +251,7 @@ software architecture. ...@@ -223,6 +251,7 @@ software architecture.
For an interrupt controller driver to support hierarchy irq_domain, it For an interrupt controller driver to support hierarchy irq_domain, it
needs to: needs to:
1) Implement irq_domain_ops.alloc and irq_domain_ops.free 1) Implement irq_domain_ops.alloc and irq_domain_ops.free
2) Optionally implement irq_domain_ops.activate and 2) Optionally implement irq_domain_ops.activate and
irq_domain_ops.deactivate. irq_domain_ops.deactivate.
......
===============
What is an IRQ? What is an IRQ?
===============
An IRQ is an interrupt request from a device. An IRQ is an interrupt request from a device.
Currently they can come in over a pin, or over a packet. Currently they can come in over a pin, or over a packet.
......
===================
Linux IOMMU Support Linux IOMMU Support
=================== ===================
...@@ -9,11 +10,11 @@ This guide gives a quick cheat sheet for some basic understanding. ...@@ -9,11 +10,11 @@ This guide gives a quick cheat sheet for some basic understanding.
Some Keywords Some Keywords
DMAR - DMA remapping - DMAR - DMA remapping
DRHD - DMA Remapping Hardware Unit Definition - DRHD - DMA Remapping Hardware Unit Definition
RMRR - Reserved memory Region Reporting Structure - RMRR - Reserved memory Region Reporting Structure
ZLR - Zero length reads from PCI devices - ZLR - Zero length reads from PCI devices
IOVA - IO Virtual address. - IOVA - IO Virtual address.
Basic stuff Basic stuff
----------- -----------
...@@ -33,7 +34,7 @@ devices that need to access these regions. OS is expected to setup ...@@ -33,7 +34,7 @@ devices that need to access these regions. OS is expected to setup
unity mappings for these regions for these devices to access these regions. unity mappings for these regions for these devices to access these regions.
How is IOVA generated? How is IOVA generated?
--------------------- ----------------------
Well behaved drivers call pci_map_*() calls before sending command to device Well behaved drivers call pci_map_*() calls before sending command to device
that needs to perform DMA. Once DMA is completed and mapping is no longer that needs to perform DMA. Once DMA is completed and mapping is no longer
...@@ -82,14 +83,14 @@ in ACPI. ...@@ -82,14 +83,14 @@ in ACPI.
ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
When DMAR is being processed and initialized by ACPI, prints DMAR locations When DMAR is being processed and initialized by ACPI, prints DMAR locations
and any RMRR's processed. and any RMRR's processed::
ACPI DMAR:Host address width 36 ACPI DMAR:Host address width 36
ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
When DMAR is enabled for use, you will notice.. When DMAR is enabled for use, you will notice..
...@@ -98,10 +99,12 @@ PCI-DMA: Using DMAR IOMMU ...@@ -98,10 +99,12 @@ PCI-DMA: Using DMAR IOMMU
Fault reporting Fault reporting
--------------- ---------------
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 ::
DMAR:[fault reason 05] PTE Write access is not set
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set DMAR:[fault reason 05] PTE Write access is not set
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set
TBD TBD
---- ----
......
Linux 2.4.2 Secure Attention Key (SAK) handling =========================================
18 March 2001, Andrew Morton Linux Secure Attention Key (SAK) handling
=========================================
:Date: 18 March 2001
:Author: Andrew Morton
An operating system's Secure Attention Key is a security tool which is An operating system's Secure Attention Key is a security tool which is
provided as protection against trojan password capturing programs. It provided as protection against trojan password capturing programs. It
...@@ -13,7 +17,7 @@ this sequence. It is only available if the kernel was compiled with ...@@ -13,7 +17,7 @@ this sequence. It is only available if the kernel was compiled with
sysrq support. sysrq support.
The proper way of generating a SAK is to define the key sequence using The proper way of generating a SAK is to define the key sequence using
`loadkeys'. This will work whether or not sysrq support is compiled ``loadkeys``. This will work whether or not sysrq support is compiled
into the kernel. into the kernel.
SAK works correctly when the keyboard is in raw mode. This means that SAK works correctly when the keyboard is in raw mode. This means that
...@@ -25,64 +29,63 @@ What key sequence should you use? Well, CTRL-ALT-DEL is used to reboot ...@@ -25,64 +29,63 @@ What key sequence should you use? Well, CTRL-ALT-DEL is used to reboot
the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll
choose CTRL-ALT-PAUSE. choose CTRL-ALT-PAUSE.
In your rc.sysinit (or rc.local) file, add the command In your rc.sysinit (or rc.local) file, add the command::
echo "control alt keycode 101 = SAK" | /bin/loadkeys echo "control alt keycode 101 = SAK" | /bin/loadkeys
And that's it! Only the superuser may reprogram the SAK key. And that's it! Only the superuser may reprogram the SAK key.
NOTES .. note::
=====
1: Linux SAK is said to be not a "true SAK" as is required by 1. Linux SAK is said to be not a "true SAK" as is required by
systems which implement C2 level security. This author does not systems which implement C2 level security. This author does not
know why. know why.
2: On the PC keyboard, SAK kills all applications which have 2. On the PC keyboard, SAK kills all applications which have
/dev/console opened. /dev/console opened.
Unfortunately this includes a number of things which you don't Unfortunately this includes a number of things which you don't
actually want killed. This is because these applications are actually want killed. This is because these applications are
incorrectly holding /dev/console open. Be sure to complain to your incorrectly holding /dev/console open. Be sure to complain to your
Linux distributor about this! Linux distributor about this!
You can identify processes which will be killed by SAK with the You can identify processes which will be killed by SAK with the
command command::
# ls -l /proc/[0-9]*/fd/* | grep console # ls -l /proc/[0-9]*/fd/* | grep console
l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console
Then: Then::
# ps aux|grep 579 # ps aux|grep 579
root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2 root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2
So `gpm' will be killed by SAK. This is a bug in gpm. It should So ``gpm`` will be killed by SAK. This is a bug in gpm. It should
be closing standard input. You can work around this by finding the be closing standard input. You can work around this by finding the
initscript which launches gpm and changing it thusly: initscript which launches gpm and changing it thusly:
Old: Old::
daemon gpm daemon gpm
New: New::
daemon gpm < /dev/null daemon gpm < /dev/null
Vixie cron also seems to have this problem, and needs the same treatment. Vixie cron also seems to have this problem, and needs the same treatment.
Also, one prominent Linux distribution has the following three Also, one prominent Linux distribution has the following three
lines in its rc.sysinit and rc scripts: lines in its rc.sysinit and rc scripts::
exec 3<&0 exec 3<&0
exec 4>&1 exec 4>&1
exec 5>&2 exec 5>&2
These commands cause *all* daemons which are launched by the These commands cause **all** daemons which are launched by the
initscripts to have file descriptors 3, 4 and 5 attached to initscripts to have file descriptors 3, 4 and 5 attached to
/dev/console. So SAK kills them all. A workaround is to simply /dev/console. So SAK kills them all. A workaround is to simply
delete these lines, but this may cause system management delete these lines, but this may cause system management
applications to malfunction - test everything well. applications to malfunction - test everything well.
SM501 Driver .. include:: <isonum.txt>
============
Copyright 2006, 2007 Simtec Electronics ============
SM501 Driver
============
:Copyright: |copy| 2006, 2007 Simtec Electronics
The Silicon Motion SM501 multimedia companion chip is a multifunction device The Silicon Motion SM501 multimedia companion chip is a multifunction device
which may provide numerous interfaces including USB host controller USB gadget, which may provide numerous interfaces including USB host controller USB gadget,
......
============================
A block layer cache (bcache)
============================
Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be
nice if you could use them as cache... Hence bcache. nice if you could use them as cache... Hence bcache.
Wiki and git repositories are at: Wiki and git repositories are at:
http://bcache.evilpiepirate.org
http://evilpiepirate.org/git/linux-bcache.git - http://bcache.evilpiepirate.org
http://evilpiepirate.org/git/bcache-tools.git - http://evilpiepirate.org/git/linux-bcache.git
- http://evilpiepirate.org/git/bcache-tools.git
It's designed around the performance characteristics of SSDs - it only allocates It's designed around the performance characteristics of SSDs - it only allocates
in erase block sized buckets, and it uses a hybrid btree/log to track cached in erase block sized buckets, and it uses a hybrid btree/log to track cached
...@@ -37,17 +42,19 @@ to be flushed. ...@@ -37,17 +42,19 @@ to be flushed.
Getting started: Getting started:
You'll need make-bcache from the bcache-tools repository. Both the cache device You'll need make-bcache from the bcache-tools repository. Both the cache device
and backing device must be formatted before use. and backing device must be formatted before use::
make-bcache -B /dev/sdb make-bcache -B /dev/sdb
make-bcache -C /dev/sdc make-bcache -C /dev/sdc
make-bcache has the ability to format multiple devices at the same time - if make-bcache has the ability to format multiple devices at the same time - if
you format your backing devices and cache device at the same time, you won't you format your backing devices and cache device at the same time, you won't
have to manually attach: have to manually attach::
make-bcache -B /dev/sda /dev/sdb -C /dev/sdc make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
bcache-tools now ships udev rules, and bcache devices are known to the kernel bcache-tools now ships udev rules, and bcache devices are known to the kernel
immediately. Without udev, you can manually register devices like this: immediately. Without udev, you can manually register devices like this::
echo /dev/sdb > /sys/fs/bcache/register echo /dev/sdb > /sys/fs/bcache/register
echo /dev/sdc > /sys/fs/bcache/register echo /dev/sdc > /sys/fs/bcache/register
...@@ -60,16 +67,16 @@ slow devices as bcache backing devices without a cache, and you can choose to ad ...@@ -60,16 +67,16 @@ slow devices as bcache backing devices without a cache, and you can choose to ad
a caching device later. a caching device later.
See 'ATTACHING' section below. See 'ATTACHING' section below.
The devices show up as: The devices show up as::
/dev/bcache<N> /dev/bcache<N>
As well as (with udev): As well as (with udev)::
/dev/bcache/by-uuid/<uuid> /dev/bcache/by-uuid/<uuid>
/dev/bcache/by-label/<label> /dev/bcache/by-label/<label>
To get started: To get started::
mkfs.ext4 /dev/bcache0 mkfs.ext4 /dev/bcache0
mount /dev/bcache0 /mnt mount /dev/bcache0 /mnt
...@@ -81,13 +88,13 @@ Cache devices are managed as sets; multiple caches per set isn't supported yet ...@@ -81,13 +88,13 @@ Cache devices are managed as sets; multiple caches per set isn't supported yet
but will allow for mirroring of metadata and dirty data in the future. Your new but will allow for mirroring of metadata and dirty data in the future. Your new
cache set shows up as /sys/fs/bcache/<UUID> cache set shows up as /sys/fs/bcache/<UUID>
ATTACHING Attaching
--------- ---------
After your cache device and backing device are registered, the backing device After your cache device and backing device are registered, the backing device
must be attached to your cache set to enable caching. Attaching a backing must be attached to your cache set to enable caching. Attaching a backing
device to a cache set is done thusly, with the UUID of the cache set in device to a cache set is done thusly, with the UUID of the cache set in
/sys/fs/bcache: /sys/fs/bcache::
echo <CSET-UUID> > /sys/block/bcache0/bcache/attach echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
...@@ -97,7 +104,7 @@ your bcache devices. If a backing device has data in a cache somewhere, the ...@@ -97,7 +104,7 @@ your bcache devices. If a backing device has data in a cache somewhere, the
important if you have writeback caching turned on. important if you have writeback caching turned on.
If you're booting up and your cache device is gone and never coming back, you If you're booting up and your cache device is gone and never coming back, you
can force run the backing device: can force run the backing device::
echo 1 > /sys/block/sdb/bcache/running echo 1 > /sys/block/sdb/bcache/running
...@@ -110,7 +117,7 @@ but all the cached data will be invalidated. If there was dirty data in the ...@@ -110,7 +117,7 @@ but all the cached data will be invalidated. If there was dirty data in the
cache, don't expect the filesystem to be recoverable - you will have massive cache, don't expect the filesystem to be recoverable - you will have massive
filesystem corruption, though ext4's fsck does work miracles. filesystem corruption, though ext4's fsck does work miracles.
ERROR HANDLING Error Handling
-------------- --------------
Bcache tries to transparently handle IO errors to/from the cache device without Bcache tries to transparently handle IO errors to/from the cache device without
...@@ -134,25 +141,27 @@ the backing devices to passthrough mode. ...@@ -134,25 +141,27 @@ the backing devices to passthrough mode.
read some of the dirty data, though. read some of the dirty data, though.
HOWTO/COOKBOOK Howto/cookbook
-------------- --------------
A) Starting a bcache with a missing caching device A) Starting a bcache with a missing caching device
If registering the backing device doesn't help, it's already there, you just need If registering the backing device doesn't help, it's already there, you just need
to force it to run without the cache: to force it to run without the cache::
host:~# echo /dev/sdb1 > /sys/fs/bcache/register host:~# echo /dev/sdb1 > /sys/fs/bcache/register
[ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered [ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered
Next, you try to register your caching device if it's present. However Next, you try to register your caching device if it's present. However
if it's absent, or registration fails for some reason, you can still if it's absent, or registration fails for some reason, you can still
start your bcache without its cache, like so: start your bcache without its cache, like so::
host:/sys/block/sdb/sdb1/bcache# echo 1 > running host:/sys/block/sdb/sdb1/bcache# echo 1 > running
Note that this may cause data loss if you were running in writeback mode. Note that this may cause data loss if you were running in writeback mode.
B) Bcache does not find its cache B) Bcache does not find its cache::
host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach
[ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set [ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set
...@@ -160,7 +169,8 @@ B) Bcache does not find its cache ...@@ -160,7 +169,8 @@ B) Bcache does not find its cache
[ 1933.478179] : cache set not found [ 1933.478179] : cache set not found
In this case, the caching device was simply not registered at boot In this case, the caching device was simply not registered at boot
or disappeared and came back, and needs to be (re-)registered: or disappeared and came back, and needs to be (re-)registered::
host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register
...@@ -180,7 +190,8 @@ device is still available at an 8KiB offset. So either via a loopdev ...@@ -180,7 +190,8 @@ device is still available at an 8KiB offset. So either via a loopdev
of the backing device created with --offset 8K, or any value defined by of the backing device created with --offset 8K, or any value defined by
--data-offset when you originally formatted bcache with `make-bcache`. --data-offset when you originally formatted bcache with `make-bcache`.
For example: For example::
losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev
This should present your unmodified backing device data in /dev/loop0 This should present your unmodified backing device data in /dev/loop0
...@@ -191,33 +202,38 @@ cache device without loosing data. ...@@ -191,33 +202,38 @@ cache device without loosing data.
E) Wiping a cache device E) Wiping a cache device
host:~# wipefs -a /dev/sdh2 ::
16 bytes were erased at offset 0x1018 (bcache)
they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 host:~# wipefs -a /dev/sdh2
16 bytes were erased at offset 0x1018 (bcache)
they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
After you boot back with bcache enabled, you recreate the cache and attach it::
After you boot back with bcache enabled, you recreate the cache and attach it: host:~# make-bcache -C /dev/sdh2
host:~# make-bcache -C /dev/sdh2 UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045
UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045 Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1
Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1 version: 0
version: 0 nbuckets: 106874
nbuckets: 106874 block_size: 1
block_size: 1 bucket_size: 1024
bucket_size: 1024 nr_in_set: 1
nr_in_set: 1 nr_this_dev: 0
nr_this_dev: 0 first_bucket: 1
first_bucket: 1 [ 650.511912] bcache: run_cache_set() invalidating existing data
[ 650.511912] bcache: run_cache_set() invalidating existing data [ 650.549228] bcache: register_cache() registered cache device sdh2
[ 650.549228] bcache: register_cache() registered cache device sdh2
start backing device with missing cache: start backing device with missing cache::
host:/sys/block/md5/bcache# echo 1 > running
attach new cache: host:/sys/block/md5/bcache# echo 1 > running
host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
[ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
attach new cache::
F) Remove or replace a caching device host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
[ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
F) Remove or replace a caching device::
host:/sys/block/sda/sda7/bcache# echo 1 > detach host:/sys/block/sda/sda7/bcache# echo 1 > detach
[ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7 [ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7
...@@ -226,13 +242,15 @@ F) Remove or replace a caching device ...@@ -226,13 +242,15 @@ F) Remove or replace a caching device
wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy
Ooops, it's disabled, but not unregistered, so it's still protected Ooops, it's disabled, but not unregistered, so it's still protected
We need to go and unregister it: We need to go and unregister it::
host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0 host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0
lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/ lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/
host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop
kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered
Now we can wipe it: Now we can wipe it::
host:~# wipefs -a /dev/nvme0n1p4 host:~# wipefs -a /dev/nvme0n1p4
/dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 /dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
...@@ -252,40 +270,44 @@ if there are any active backing or caching devices left on it: ...@@ -252,40 +270,44 @@ if there are any active backing or caching devices left on it:
1) Is it present in /dev/bcache* ? (there are times where it won't be) 1) Is it present in /dev/bcache* ? (there are times where it won't be)
If so, it's easy: If so, it's easy::
host:/sys/block/bcache0/bcache# echo 1 > stop host:/sys/block/bcache0/bcache# echo 1 > stop
2) But if your backing device is gone, this won't work: 2) But if your backing device is gone, this won't work::
host:/sys/block/bcache0# cd bcache host:/sys/block/bcache0# cd bcache
bash: cd: bcache: No such file or directory bash: cd: bcache: No such file or directory
In this case, you may have to unregister the dmcrypt block device that In this case, you may have to unregister the dmcrypt block device that
references this bcache to free it up: references this bcache to free it up::
host:~# dmsetup remove oldds1 host:~# dmsetup remove oldds1
bcache: bcache_device_free() bcache0 stopped bcache: bcache_device_free() bcache0 stopped
bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered
This causes the backing bcache to be removed from /sys/fs/bcache and This causes the backing bcache to be removed from /sys/fs/bcache and
then it can be reused. This would be true of any block device stacking then it can be reused. This would be true of any block device stacking
where bcache is a lower device. where bcache is a lower device.
3) In other cases, you can also look in /sys/fs/bcache/::
3) In other cases, you can also look in /sys/fs/bcache/: host:/sys/fs/bcache# ls -l */{cache?,bdev?}
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
host:/sys/fs/bcache# ls -l */{cache?,bdev?} The device names will show which UUID is relevant, cd in that directory
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/ and stop the cache::
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
The device names will show which UUID is relevant, cd in that directory
and stop the cache:
host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop
This will free up bcache references and let you reuse the partition for This will free up bcache references and let you reuse the partition for
other purposes. other purposes.
TROUBLESHOOTING PERFORMANCE Troubleshooting performance
--------------------------- ---------------------------
Bcache has a bunch of config options and tunables. The defaults are intended to Bcache has a bunch of config options and tunables. The defaults are intended to
...@@ -301,11 +323,13 @@ want for getting the best possible numbers when benchmarking. ...@@ -301,11 +323,13 @@ want for getting the best possible numbers when benchmarking.
raid stripe size to get the disk multiples that you would like. raid stripe size to get the disk multiples that you would like.
For example: If you have a 64k stripe size, then the following offset For example: If you have a 64k stripe size, then the following offset
would provide alignment for many common RAID5 data spindle counts: would provide alignment for many common RAID5 data spindle counts::
64k * 2*2*2*3*3*5*7 bytes = 161280k 64k * 2*2*2*3*3*5*7 bytes = 161280k
That space is wasted, but for only 157.5MB you can grow your RAID 5 That space is wasted, but for only 157.5MB you can grow your RAID 5
volume to the following data-spindle counts without re-aligning: volume to the following data-spindle counts without re-aligning::
3,4,5,6,7,8,9,10,12,14,15,18,20,21 ... 3,4,5,6,7,8,9,10,12,14,15,18,20,21 ...
- Bad write performance - Bad write performance
...@@ -313,9 +337,9 @@ want for getting the best possible numbers when benchmarking. ...@@ -313,9 +337,9 @@ want for getting the best possible numbers when benchmarking.
If write performance is not what you expected, you probably wanted to be If write performance is not what you expected, you probably wanted to be
running in writeback mode, which isn't the default (not due to a lack of running in writeback mode, which isn't the default (not due to a lack of
maturity, but simply because in writeback mode you'll lose data if something maturity, but simply because in writeback mode you'll lose data if something
happens to your SSD) happens to your SSD)::
# echo writeback > /sys/block/bcache0/bcache/cache_mode # echo writeback > /sys/block/bcache0/bcache/cache_mode
- Bad performance, or traffic not going to the SSD that you'd expect - Bad performance, or traffic not going to the SSD that you'd expect
...@@ -325,13 +349,13 @@ want for getting the best possible numbers when benchmarking. ...@@ -325,13 +349,13 @@ want for getting the best possible numbers when benchmarking.
accessed data out of your cache. accessed data out of your cache.
But if you want to benchmark reads from cache, and you start out with fio But if you want to benchmark reads from cache, and you start out with fio
writing an 8 gigabyte test file - so you want to disable that. writing an 8 gigabyte test file - so you want to disable that::
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
To set it back to the default (4 mb), do To set it back to the default (4 mb), do::
# echo 4M > /sys/block/bcache0/bcache/sequential_cutoff # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
- Traffic's still going to the spindle/still getting cache misses - Traffic's still going to the spindle/still getting cache misses
...@@ -344,10 +368,10 @@ want for getting the best possible numbers when benchmarking. ...@@ -344,10 +368,10 @@ want for getting the best possible numbers when benchmarking.
throttles traffic if the latency exceeds a threshold (it does this by throttles traffic if the latency exceeds a threshold (it does this by
cranking down the sequential bypass). cranking down the sequential bypass).
You can disable this if you need to by setting the thresholds to 0: You can disable this if you need to by setting the thresholds to 0::
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
The default is 2000 us (2 milliseconds) for reads, and 20000 for writes. The default is 2000 us (2 milliseconds) for reads, and 20000 for writes.
...@@ -369,7 +393,7 @@ want for getting the best possible numbers when benchmarking. ...@@ -369,7 +393,7 @@ want for getting the best possible numbers when benchmarking.
a fix for the issue there). a fix for the issue there).
SYSFS - BACKING DEVICE Sysfs - backing device
---------------------- ----------------------
Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
...@@ -454,7 +478,8 @@ writeback_running ...@@ -454,7 +478,8 @@ writeback_running
still be added to the cache until it is mostly full; only meant for still be added to the cache until it is mostly full; only meant for
benchmarking. Defaults to on. benchmarking. Defaults to on.
SYSFS - BACKING DEVICE STATS: Sysfs - backing device stats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are directories with these numbers for a running total, as well as There are directories with these numbers for a running total, as well as
versions that decay over the past day, hour and 5 minutes; they're also versions that decay over the past day, hour and 5 minutes; they're also
...@@ -463,14 +488,11 @@ aggregated in the cache set directory as well. ...@@ -463,14 +488,11 @@ aggregated in the cache set directory as well.
bypassed bypassed
Amount of IO (both reads and writes) that has bypassed the cache Amount of IO (both reads and writes) that has bypassed the cache
cache_hits cache_hits, cache_misses, cache_hit_ratio
cache_misses
cache_hit_ratio
Hits and misses are counted per individual IO as bcache sees them; a Hits and misses are counted per individual IO as bcache sees them; a
partial hit is counted as a miss. partial hit is counted as a miss.
cache_bypass_hits cache_bypass_hits, cache_bypass_misses
cache_bypass_misses
Hits and misses for IO that is intended to skip the cache are still counted, Hits and misses for IO that is intended to skip the cache are still counted,
but broken out here. but broken out here.
...@@ -482,7 +504,8 @@ cache_miss_collisions ...@@ -482,7 +504,8 @@ cache_miss_collisions
cache_readaheads cache_readaheads
Count of times readahead occurred. Count of times readahead occurred.
SYSFS - CACHE SET: Sysfs - cache set
~~~~~~~~~~~~~~~~~
Available at /sys/fs/bcache/<cset-uuid> Available at /sys/fs/bcache/<cset-uuid>
...@@ -520,8 +543,7 @@ flash_vol_create ...@@ -520,8 +543,7 @@ flash_vol_create
Echoing a size to this file (in human readable units, k/M/G) creates a thinly Echoing a size to this file (in human readable units, k/M/G) creates a thinly
provisioned volume backed by the cache set. provisioned volume backed by the cache set.
io_error_halflife io_error_halflife, io_error_limit
io_error_limit
These determines how many errors we accept before disabling the cache. These determines how many errors we accept before disabling the cache.
Each error is decayed by the half life (in # ios). If the decaying count Each error is decayed by the half life (in # ios). If the decaying count
reaches io_error_limit dirty data is written out and the cache is disabled. reaches io_error_limit dirty data is written out and the cache is disabled.
...@@ -545,7 +567,8 @@ unregister ...@@ -545,7 +567,8 @@ unregister
Detaches all backing devices and closes the cache devices; if dirty data is Detaches all backing devices and closes the cache devices; if dirty data is
present it will disable writeback caching and wait for it to be flushed. present it will disable writeback caching and wait for it to be flushed.
SYSFS - CACHE SET INTERNAL: Sysfs - cache set internal
~~~~~~~~~~~~~~~~~~~~~~~~~~
This directory also exposes timings for a number of internal operations, with This directory also exposes timings for a number of internal operations, with
separate files for average duration, average frequency, last occurrence and max separate files for average duration, average frequency, last occurrence and max
...@@ -574,7 +597,8 @@ cache_read_races ...@@ -574,7 +597,8 @@ cache_read_races
trigger_gc trigger_gc
Writing to this file forces garbage collection to run. Writing to this file forces garbage collection to run.
SYSFS - CACHE DEVICE: Sysfs - Cache device
~~~~~~~~~~~~~~~~~~~~
Available at /sys/block/<cdev>/bcache Available at /sys/block/<cdev>/bcache
......
=============================================================== ===================================================================
== BT8XXGPIO driver == A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio)
== == ===================================================================
== A driver for a selfmade cheap BT8xx based PCI GPIO-card ==
== ==
== For advanced documentation, see ==
== http://www.bu3sch.de/btgpio.php ==
===============================================================
For advanced documentation, see http://www.bu3sch.de/btgpio.php
A generic digital 24-port PCI GPIO card can be built out of an ordinary A generic digital 24-port PCI GPIO card can be built out of an ordinary
Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The
...@@ -17,9 +13,8 @@ The bt8xx chip does have 24 digital GPIO ports. ...@@ -17,9 +13,8 @@ The bt8xx chip does have 24 digital GPIO ports.
These ports are accessible via 24 pins on the SMD chip package. These ports are accessible via 24 pins on the SMD chip package.
============================================== How to physically access the GPIO pins
== How to physically access the GPIO pins == ======================================
==============================================
The are several ways to access these pins. One might unsolder the whole chip The are several ways to access these pins. One might unsolder the whole chip
and put it on a custom PCI board, or one might only unsolder each individual and put it on a custom PCI board, or one might only unsolder each individual
...@@ -27,7 +22,7 @@ GPIO pin and solder that to some tiny wire. As the chip package really is tiny ...@@ -27,7 +22,7 @@ GPIO pin and solder that to some tiny wire. As the chip package really is tiny
there are some advanced soldering skills needed in any case. there are some advanced soldering skills needed in any case.
The physical pinouts are drawn in the following ASCII art. The physical pinouts are drawn in the following ASCII art.
The GPIO pins are marked with G00-G23 The GPIO pins are marked with G00-G23::
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
......
======================================================================= =============
README for btmrvl driver btmrvl driver
======================================================================= =============
All commands are used via debugfs interface. All commands are used via debugfs interface.
===================== Set/get driver configurations
Set/get driver configurations: =============================
Path: /debug/btmrvl/config/ Path: /debug/btmrvl/config/
gpiogap=[n] gpiogap=[n], hscfgcmd
hscfgcmd These commands are used to configure the host sleep parameters::
These commands are used to configure the host sleep parameters.
bit 8:0 -- Gap bit 8:0 -- Gap
bit 16:8 -- GPIO bit 16:8 -- GPIO
...@@ -23,7 +21,8 @@ hscfgcmd ...@@ -23,7 +21,8 @@ hscfgcmd
where Gap is the gap in milli seconds between wakeup signal and where Gap is the gap in milli seconds between wakeup signal and
wakeup event, or 0xff for special host sleep setting. wakeup event, or 0xff for special host sleep setting.
Usage: Usage::
# Use SDIO interface to wake up the host and set GAP to 0x80: # Use SDIO interface to wake up the host and set GAP to 0x80:
echo 0xff80 > /debug/btmrvl/config/gpiogap echo 0xff80 > /debug/btmrvl/config/gpiogap
echo 1 > /debug/btmrvl/config/hscfgcmd echo 1 > /debug/btmrvl/config/hscfgcmd
...@@ -32,15 +31,16 @@ hscfgcmd ...@@ -32,15 +31,16 @@ hscfgcmd
echo 0x03ff > /debug/btmrvl/config/gpiogap echo 0x03ff > /debug/btmrvl/config/gpiogap
echo 1 > /debug/btmrvl/config/hscfgcmd echo 1 > /debug/btmrvl/config/hscfgcmd
psmode=[n] psmode=[n], pscmd
pscmd
These commands are used to enable/disable auto sleep mode These commands are used to enable/disable auto sleep mode
where the option is: where the option is::
1 -- Enable auto sleep mode 1 -- Enable auto sleep mode
0 -- Disable auto sleep mode 0 -- Disable auto sleep mode
Usage: Usage::
# Enable auto sleep mode # Enable auto sleep mode
echo 1 > /debug/btmrvl/config/psmode echo 1 > /debug/btmrvl/config/psmode
echo 1 > /debug/btmrvl/config/pscmd echo 1 > /debug/btmrvl/config/pscmd
...@@ -50,15 +50,16 @@ pscmd ...@@ -50,15 +50,16 @@ pscmd
echo 1 > /debug/btmrvl/config/pscmd echo 1 > /debug/btmrvl/config/pscmd
hsmode=[n] hsmode=[n], hscmd
hscmd
These commands are used to enable host sleep or wake up firmware These commands are used to enable host sleep or wake up firmware
where the option is: where the option is::
1 -- Enable host sleep 1 -- Enable host sleep
0 -- Wake up firmware 0 -- Wake up firmware
Usage: Usage::
# Enable host sleep # Enable host sleep
echo 1 > /debug/btmrvl/config/hsmode echo 1 > /debug/btmrvl/config/hsmode
echo 1 > /debug/btmrvl/config/hscmd echo 1 > /debug/btmrvl/config/hscmd
...@@ -68,12 +69,13 @@ hscmd ...@@ -68,12 +69,13 @@ hscmd
echo 1 > /debug/btmrvl/config/hscmd echo 1 > /debug/btmrvl/config/hscmd
====================== Get driver status
Get driver status: =================
Path: /debug/btmrvl/status/ Path: /debug/btmrvl/status/
Usage: Usage::
cat /debug/btmrvl/status/<args> cat /debug/btmrvl/status/<args>
where the args are: where the args are:
...@@ -90,14 +92,17 @@ hsstate ...@@ -90,14 +92,17 @@ hsstate
txdnldrdy txdnldrdy
This command displays the value of Tx download ready flag. This command displays the value of Tx download ready flag.
Issuing a raw hci command
===================== =========================
Use hcitool to issue raw hci command, refer to hcitool manual Use hcitool to issue raw hci command, refer to hcitool manual
Usage: Hcitool cmd <ogf> <ocf> [Parameters] Usage::
Hcitool cmd <ogf> <ocf> [Parameters]
Interface Control Command::
Interface Control Command
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface
...@@ -105,13 +110,13 @@ Use hcitool to issue raw hci command, refer to hcitool manual ...@@ -105,13 +110,13 @@ Use hcitool to issue raw hci command, refer to hcitool manual
hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface
hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface
======================================================================= SD8688 firmware
===============
SD8688 firmware: Images:
/lib/firmware/sd8688_helper.bin - /lib/firmware/sd8688_helper.bin
/lib/firmware/sd8688.bin - /lib/firmware/sd8688.bin
The images can be downloaded from: The images can be downloaded from:
......
[ NOTE: The virt_to_bus() and bus_to_virt() functions have been ==========================================================
How to access I/O mapped memory from within device drivers
==========================================================
:Author: Linus
.. warning::
The virt_to_bus() and bus_to_virt() functions have been
superseded by the functionality provided by the PCI DMA interface superseded by the functionality provided by the PCI DMA interface
(see Documentation/DMA-API-HOWTO.txt). They continue (see Documentation/DMA-API-HOWTO.txt). They continue
to be documented below for historical purposes, but new code to be documented below for historical purposes, but new code
must not use them. --davidm 00/12/12 ] must not use them. --davidm 00/12/12
[ This is a mail message in response to a query on IO mapping, thus the ::
strange format for a "document" ]
[ This is a mail message in response to a query on IO mapping, thus the
strange format for a "document" ]
The AHA-1542 is a bus-master device, and your patch makes the driver give the The AHA-1542 is a bus-master device, and your patch makes the driver give the
controller the physical address of the buffers, which is correct on x86 controller the physical address of the buffers, which is correct on x86
(because all bus master devices see the physical memory mappings directly). (because all bus master devices see the physical memory mappings directly).
However, on many setups, there are actually _three_ different ways of looking However, on many setups, there are actually **three** different ways of looking
at memory addresses, and in this case we actually want the third, the at memory addresses, and in this case we actually want the third, the
so-called "bus address". so-called "bus address".
...@@ -38,7 +48,7 @@ because the memory and the devices share the same address space, and that is ...@@ -38,7 +48,7 @@ because the memory and the devices share the same address space, and that is
not generally necessarily true on other PCI/ISA setups. not generally necessarily true on other PCI/ISA setups.
Now, just as an example, on the PReP (PowerPC Reference Platform), the Now, just as an example, on the PReP (PowerPC Reference Platform), the
CPU sees a memory map something like this (this is from memory): CPU sees a memory map something like this (this is from memory)::
0-2 GB "real memory" 0-2 GB "real memory"
2 GB-3 GB "system IO" (inb/out and similar accesses on x86) 2 GB-3 GB "system IO" (inb/out and similar accesses on x86)
...@@ -52,7 +62,7 @@ So when the CPU wants any bus master to write to physical memory 0, it ...@@ -52,7 +62,7 @@ So when the CPU wants any bus master to write to physical memory 0, it
has to give the master address 0x80000000 as the memory address. has to give the master address 0x80000000 as the memory address.
So, for example, depending on how the kernel is actually mapped on the So, for example, depending on how the kernel is actually mapped on the
PPC, you can end up with a setup like this: PPC, you can end up with a setup like this::
physical address: 0 physical address: 0
virtual address: 0xC0000000 virtual address: 0xC0000000
...@@ -61,7 +71,7 @@ PPC, you can end up with a setup like this: ...@@ -61,7 +71,7 @@ PPC, you can end up with a setup like this:
where all the addresses actually point to the same thing. It's just seen where all the addresses actually point to the same thing. It's just seen
through different translations.. through different translations..
Similarly, on the Alpha, the normal translation is Similarly, on the Alpha, the normal translation is::
physical address: 0 physical address: 0
virtual address: 0xfffffc0000000000 virtual address: 0xfffffc0000000000
...@@ -70,7 +80,7 @@ Similarly, on the Alpha, the normal translation is ...@@ -70,7 +80,7 @@ Similarly, on the Alpha, the normal translation is
(but there are also Alphas where the physical address and the bus address (but there are also Alphas where the physical address and the bus address
are the same). are the same).
Anyway, the way to look up all these translations, you do Anyway, the way to look up all these translations, you do::
#include <asm/io.h> #include <asm/io.h>
...@@ -81,8 +91,8 @@ Anyway, the way to look up all these translations, you do ...@@ -81,8 +91,8 @@ Anyway, the way to look up all these translations, you do
Now, when do you need these? Now, when do you need these?
You want the _virtual_ address when you are actually going to access that You want the **virtual** address when you are actually going to access that
pointer from the kernel. So you can have something like this: pointer from the kernel. So you can have something like this::
/* /*
* this is the hardware "mailbox" we use to communicate with * this is the hardware "mailbox" we use to communicate with
...@@ -104,7 +114,7 @@ pointer from the kernel. So you can have something like this: ...@@ -104,7 +114,7 @@ pointer from the kernel. So you can have something like this:
... ...
on the other hand, you want the bus address when you have a buffer that on the other hand, you want the bus address when you have a buffer that
you want to give to the controller: you want to give to the controller::
/* ask the controller to read the sense status into "sense_buffer" */ /* ask the controller to read the sense status into "sense_buffer" */
mbox.bufstart = virt_to_bus(&sense_buffer); mbox.bufstart = virt_to_bus(&sense_buffer);
...@@ -112,7 +122,7 @@ you want to give to the controller: ...@@ -112,7 +122,7 @@ you want to give to the controller:
mbox.status = 0; mbox.status = 0;
notify_controller(&mbox); notify_controller(&mbox);
And you generally _never_ want to use the physical address, because you can't And you generally **never** want to use the physical address, because you can't
use that from the CPU (the CPU only uses translated virtual addresses), and use that from the CPU (the CPU only uses translated virtual addresses), and
you can't use it from the bus master. you can't use it from the bus master.
...@@ -124,8 +134,10 @@ be remapped as measured in units of pages, a.k.a. the pfn (the memory ...@@ -124,8 +134,10 @@ be remapped as measured in units of pages, a.k.a. the pfn (the memory
management layer doesn't know about devices outside the CPU, so it management layer doesn't know about devices outside the CPU, so it
shouldn't need to know about "bus addresses" etc). shouldn't need to know about "bus addresses" etc).
NOTE NOTE NOTE! The above is only one part of the whole equation. The above .. note::
only talks about "real memory", that is, CPU memory (RAM).
The above is only one part of the whole equation. The above
only talks about "real memory", that is, CPU memory (RAM).
There is a completely different type of memory too, and that's the "shared There is a completely different type of memory too, and that's the "shared
memory" on the PCI or ISA bus. That's generally not RAM (although in the case memory" on the PCI or ISA bus. That's generally not RAM (although in the case
...@@ -137,20 +149,22 @@ whatever, and there is only one way to access it: the readb/writeb and ...@@ -137,20 +149,22 @@ whatever, and there is only one way to access it: the readb/writeb and
related functions. You should never take the address of such memory, because related functions. You should never take the address of such memory, because
there is really nothing you can do with such an address: it's not there is really nothing you can do with such an address: it's not
conceptually in the same memory space as "real memory" at all, so you cannot conceptually in the same memory space as "real memory" at all, so you cannot
just dereference a pointer. (Sadly, on x86 it _is_ in the same memory space, just dereference a pointer. (Sadly, on x86 it **is** in the same memory space,
so on x86 it actually works to just deference a pointer, but it's not so on x86 it actually works to just deference a pointer, but it's not
portable). portable).
For such memory, you can do things like For such memory, you can do things like:
- reading::
- reading:
/* /*
* read first 32 bits from ISA memory at 0xC0000, aka * read first 32 bits from ISA memory at 0xC0000, aka
* C000:0000 in DOS terms * C000:0000 in DOS terms
*/ */
unsigned int signature = isa_readl(0xC0000); unsigned int signature = isa_readl(0xC0000);
- remapping and writing: - remapping and writing::
/* /*
* remap framebuffer PCI memory area at 0xFC000000, * remap framebuffer PCI memory area at 0xFC000000,
* size 1MB, so that we can access it: We can directly * size 1MB, so that we can access it: We can directly
...@@ -165,7 +179,8 @@ For such memory, you can do things like ...@@ -165,7 +179,8 @@ For such memory, you can do things like
/* unmap when we unload the driver */ /* unmap when we unload the driver */
iounmap(baseptr); iounmap(baseptr);
- copying and clearing: - copying and clearing::
/* get the 6-byte Ethernet address at ISA address E000:0040 */ /* get the 6-byte Ethernet address at ISA address E000:0040 */
memcpy_fromio(kernel_buffer, 0xE0040, 6); memcpy_fromio(kernel_buffer, 0xE0040, 6);
/* write a packet to the driver */ /* write a packet to the driver */
...@@ -181,10 +196,10 @@ happy that your driver works ;) ...@@ -181,10 +196,10 @@ happy that your driver works ;)
Note that kernel versions 2.0.x (and earlier) mistakenly called the Note that kernel versions 2.0.x (and earlier) mistakenly called the
ioremap() function "vremap()". ioremap() is the proper name, but I ioremap() function "vremap()". ioremap() is the proper name, but I
didn't think straight when I wrote it originally. People who have to didn't think straight when I wrote it originally. People who have to
support both can do something like: support both can do something like::
/* support old naming silliness */ /* support old naming silliness */
#if LINUX_VERSION_CODE < 0x020100 #if LINUX_VERSION_CODE < 0x020100
#define ioremap vremap #define ioremap vremap
#define iounmap vfree #define iounmap vfree
#endif #endif
...@@ -196,13 +211,10 @@ And the above sounds worse than it really is. Most real drivers really ...@@ -196,13 +211,10 @@ And the above sounds worse than it really is. Most real drivers really
don't do all that complex things (or rather: the complexity is not so don't do all that complex things (or rather: the complexity is not so
much in the actual IO accesses as in error handling and timeouts etc). much in the actual IO accesses as in error handling and timeouts etc).
It's generally not hard to fix drivers, and in many cases the code It's generally not hard to fix drivers, and in many cases the code
actually looks better afterwards: actually looks better afterwards::
unsigned long signature = *(unsigned int *) 0xC0000; unsigned long signature = *(unsigned int *) 0xC0000;
vs vs
unsigned long signature = readl(0xC0000); unsigned long signature = readl(0xC0000);
I think the second version actually is more readable, no? I think the second version actually is more readable, no?
Linus
Cache and TLB Flushing ==================================
Under Linux Cache and TLB Flushing Under Linux
==================================
David S. Miller <davem@redhat.com> :Author: David S. Miller <davem@redhat.com>
This document describes the cache/tlb flushing interfaces called This document describes the cache/tlb flushing interfaces called
by the Linux VM subsystem. It enumerates over each interface, by the Linux VM subsystem. It enumerates over each interface,
...@@ -28,7 +29,7 @@ Therefore when software page table changes occur, the kernel will ...@@ -28,7 +29,7 @@ Therefore when software page table changes occur, the kernel will
invoke one of the following flush methods _after_ the page table invoke one of the following flush methods _after_ the page table
changes occur: changes occur:
1) void flush_tlb_all(void) 1) ``void flush_tlb_all(void)``
The most severe flush of all. After this interface runs, The most severe flush of all. After this interface runs,
any previous page table modification whatsoever will be any previous page table modification whatsoever will be
...@@ -37,7 +38,7 @@ changes occur: ...@@ -37,7 +38,7 @@ changes occur:
This is usually invoked when the kernel page tables are This is usually invoked when the kernel page tables are
changed, since such translations are "global" in nature. changed, since such translations are "global" in nature.
2) void flush_tlb_mm(struct mm_struct *mm) 2) ``void flush_tlb_mm(struct mm_struct *mm)``
This interface flushes an entire user address space from This interface flushes an entire user address space from
the TLB. After running, this interface must make sure that the TLB. After running, this interface must make sure that
...@@ -49,8 +50,8 @@ changes occur: ...@@ -49,8 +50,8 @@ changes occur:
page table operations such as what happens during page table operations such as what happens during
fork, and exec. fork, and exec.
3) void flush_tlb_range(struct vm_area_struct *vma, 3) ``void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end) unsigned long start, unsigned long end)``
Here we are flushing a specific range of (user) virtual Here we are flushing a specific range of (user) virtual
address translations from the TLB. After running, this address translations from the TLB. After running, this
...@@ -69,7 +70,7 @@ changes occur: ...@@ -69,7 +70,7 @@ changes occur:
call flush_tlb_page (see below) for each entry which may be call flush_tlb_page (see below) for each entry which may be
modified. modified.
4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) 4) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)``
This time we need to remove the PAGE_SIZE sized translation This time we need to remove the PAGE_SIZE sized translation
from the TLB. The 'vma' is the backing structure used by from the TLB. The 'vma' is the backing structure used by
...@@ -87,8 +88,8 @@ changes occur: ...@@ -87,8 +88,8 @@ changes occur:
This is used primarily during fault processing. This is used primarily during fault processing.
5) void update_mmu_cache(struct vm_area_struct *vma, 5) ``void update_mmu_cache(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep) unsigned long address, pte_t *ptep)``
At the end of every page fault, this routine is invoked to At the end of every page fault, this routine is invoked to
tell the architecture specific code that a translation tell the architecture specific code that a translation
...@@ -100,7 +101,7 @@ changes occur: ...@@ -100,7 +101,7 @@ changes occur:
translations for software managed TLB configurations. translations for software managed TLB configurations.
The sparc64 port currently does this. The sparc64 port currently does this.
6) void tlb_migrate_finish(struct mm_struct *mm) 6) ``void tlb_migrate_finish(struct mm_struct *mm)``
This interface is called at the end of an explicit This interface is called at the end of an explicit
process migration. This interface provides a hook process migration. This interface provides a hook
...@@ -112,7 +113,7 @@ changes occur: ...@@ -112,7 +113,7 @@ changes occur:
Next, we have the cache flushing interfaces. In general, when Linux Next, we have the cache flushing interfaces. In general, when Linux
is changing an existing virtual-->physical mapping to a new value, is changing an existing virtual-->physical mapping to a new value,
the sequence will be in one of the following forms: the sequence will be in one of the following forms::
1) flush_cache_mm(mm); 1) flush_cache_mm(mm);
change_all_page_tables_of(mm); change_all_page_tables_of(mm);
...@@ -143,7 +144,7 @@ and have no dependency on translation information. ...@@ -143,7 +144,7 @@ and have no dependency on translation information.
Here are the routines, one by one: Here are the routines, one by one:
1) void flush_cache_mm(struct mm_struct *mm) 1) ``void flush_cache_mm(struct mm_struct *mm)``
This interface flushes an entire user address space from This interface flushes an entire user address space from
the caches. That is, after running, there will be no cache the caches. That is, after running, there will be no cache
...@@ -152,7 +153,7 @@ Here are the routines, one by one: ...@@ -152,7 +153,7 @@ Here are the routines, one by one:
This interface is used to handle whole address space This interface is used to handle whole address space
page table operations such as what happens during exit and exec. page table operations such as what happens during exit and exec.
2) void flush_cache_dup_mm(struct mm_struct *mm) 2) ``void flush_cache_dup_mm(struct mm_struct *mm)``
This interface flushes an entire user address space from This interface flushes an entire user address space from
the caches. That is, after running, there will be no cache the caches. That is, after running, there will be no cache
...@@ -164,8 +165,8 @@ Here are the routines, one by one: ...@@ -164,8 +165,8 @@ Here are the routines, one by one:
This option is separate from flush_cache_mm to allow some This option is separate from flush_cache_mm to allow some
optimizations for VIPT caches. optimizations for VIPT caches.
3) void flush_cache_range(struct vm_area_struct *vma, 3) ``void flush_cache_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end) unsigned long start, unsigned long end)``
Here we are flushing a specific range of (user) virtual Here we are flushing a specific range of (user) virtual
addresses from the cache. After running, there will be no addresses from the cache. After running, there will be no
...@@ -181,7 +182,7 @@ Here are the routines, one by one: ...@@ -181,7 +182,7 @@ Here are the routines, one by one:
call flush_cache_page (see below) for each entry which may be call flush_cache_page (see below) for each entry which may be
modified. modified.
4) void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) 4) ``void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)``
This time we need to remove a PAGE_SIZE sized range This time we need to remove a PAGE_SIZE sized range
from the cache. The 'vma' is the backing structure used by from the cache. The 'vma' is the backing structure used by
...@@ -202,7 +203,7 @@ Here are the routines, one by one: ...@@ -202,7 +203,7 @@ Here are the routines, one by one:
This is used primarily during fault processing. This is used primarily during fault processing.
5) void flush_cache_kmaps(void) 5) ``void flush_cache_kmaps(void)``
This routine need only be implemented if the platform utilizes This routine need only be implemented if the platform utilizes
highmem. It will be called right before all of the kmaps highmem. It will be called right before all of the kmaps
...@@ -214,8 +215,8 @@ Here are the routines, one by one: ...@@ -214,8 +215,8 @@ Here are the routines, one by one:
This routing should be implemented in asm/highmem.h This routing should be implemented in asm/highmem.h
6) void flush_cache_vmap(unsigned long start, unsigned long end) 6) ``void flush_cache_vmap(unsigned long start, unsigned long end)``
void flush_cache_vunmap(unsigned long start, unsigned long end) ``void flush_cache_vunmap(unsigned long start, unsigned long end)``
Here in these two interfaces we are flushing a specific range Here in these two interfaces we are flushing a specific range
of (kernel) virtual addresses from the cache. After running, of (kernel) virtual addresses from the cache. After running,
...@@ -243,8 +244,10 @@ size). This setting will force the SYSv IPC layer to only allow user ...@@ -243,8 +244,10 @@ size). This setting will force the SYSv IPC layer to only allow user
processes to mmap shared memory at address which are a multiple of processes to mmap shared memory at address which are a multiple of
this value. this value.
NOTE: This does not fix shared mmaps, check out the sparc64 port for .. note::
one way to solve this (in particular SPARC_FLAG_MMAPSHARED).
This does not fix shared mmaps, check out the sparc64 port for
one way to solve this (in particular SPARC_FLAG_MMAPSHARED).
Next, you have to solve the D-cache aliasing issue for all Next, you have to solve the D-cache aliasing issue for all
other cases. Please keep in mind that fact that, for a given page other cases. Please keep in mind that fact that, for a given page
...@@ -255,8 +258,8 @@ physical page into its address space, by implication the D-cache ...@@ -255,8 +258,8 @@ physical page into its address space, by implication the D-cache
aliasing problem has the potential to exist since the kernel already aliasing problem has the potential to exist since the kernel already
maps this page at its virtual address. maps this page at its virtual address.
void copy_user_page(void *to, void *from, unsigned long addr, struct page *page) ``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)``
void clear_user_page(void *to, unsigned long addr, struct page *page) ``void clear_user_page(void *to, unsigned long addr, struct page *page)``
These two routines store data in user anonymous or COW These two routines store data in user anonymous or COW
pages. It allows a port to efficiently avoid D-cache alias pages. It allows a port to efficiently avoid D-cache alias
...@@ -276,14 +279,16 @@ maps this page at its virtual address. ...@@ -276,14 +279,16 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, these two routines may If D-cache aliasing is not an issue, these two routines may
simply call memcpy/memset directly and do nothing more. simply call memcpy/memset directly and do nothing more.
void flush_dcache_page(struct page *page) ``void flush_dcache_page(struct page *page)``
Any time the kernel writes to a page cache page, _OR_ Any time the kernel writes to a page cache page, _OR_
the kernel is about to read from a page cache page and the kernel is about to read from a page cache page and
user space shared/writable mappings of this page potentially user space shared/writable mappings of this page potentially
exist, this routine is called. exist, this routine is called.
NOTE: This routine need only be called for page cache pages .. note::
This routine need only be called for page cache pages
which can potentially ever be mapped into the address which can potentially ever be mapped into the address
space of a user process. So for example, VFS layer code space of a user process. So for example, VFS layer code
handling vfs symlinks in the page cache need not call handling vfs symlinks in the page cache need not call
...@@ -322,18 +327,19 @@ maps this page at its virtual address. ...@@ -322,18 +327,19 @@ maps this page at its virtual address.
made of this flag bit, and if set the flush is done and the flag made of this flag bit, and if set the flush is done and the flag
bit is cleared. bit is cleared.
IMPORTANT NOTE: It is often important, if you defer the flush, .. important::
It is often important, if you defer the flush,
that the actual flush occurs on the same CPU that the actual flush occurs on the same CPU
as did the cpu stores into the page to make it as did the cpu stores into the page to make it
dirty. Again, see sparc64 for examples of how dirty. Again, see sparc64 for examples of how
to deal with this. to deal with this.
void copy_to_user_page(struct vm_area_struct *vma, struct page *page, ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, unsigned long user_vaddr, void *dst, void *src, int len)``
void *dst, void *src, int len) ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
void copy_from_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len)``
unsigned long user_vaddr,
void *dst, void *src, int len)
When the kernel needs to copy arbitrary data in and out When the kernel needs to copy arbitrary data in and out
of arbitrary user pages (f.e. for ptrace()) it will use of arbitrary user pages (f.e. for ptrace()) it will use
these two routines. these two routines.
...@@ -344,8 +350,9 @@ maps this page at its virtual address. ...@@ -344,8 +350,9 @@ maps this page at its virtual address.
likely that you will need to flush the instruction cache likely that you will need to flush the instruction cache
for copy_to_user_page(). for copy_to_user_page().
void flush_anon_page(struct vm_area_struct *vma, struct page *page, ``void flush_anon_page(struct vm_area_struct *vma, struct page *page,
unsigned long vmaddr) unsigned long vmaddr)``
When the kernel needs to access the contents of an anonymous When the kernel needs to access the contents of an anonymous
page, it calls this function (currently only page, it calls this function (currently only
get_user_pages()). Note: flush_dcache_page() deliberately get_user_pages()). Note: flush_dcache_page() deliberately
...@@ -354,7 +361,8 @@ maps this page at its virtual address. ...@@ -354,7 +361,8 @@ maps this page at its virtual address.
architectures). For incoherent architectures, it should flush architectures). For incoherent architectures, it should flush
the cache of the page at vmaddr. the cache of the page at vmaddr.
void flush_kernel_dcache_page(struct page *page) ``void flush_kernel_dcache_page(struct page *page)``
When the kernel needs to modify a user page is has obtained When the kernel needs to modify a user page is has obtained
with kmap, it calls this function after all modifications are with kmap, it calls this function after all modifications are
complete (but before kunmapping it) to bring the underlying complete (but before kunmapping it) to bring the underlying
...@@ -366,14 +374,16 @@ maps this page at its virtual address. ...@@ -366,14 +374,16 @@ maps this page at its virtual address.
the kernel cache for page (using page_address(page)). the kernel cache for page (using page_address(page)).
void flush_icache_range(unsigned long start, unsigned long end) ``void flush_icache_range(unsigned long start, unsigned long end)``
When the kernel stores into addresses that it will execute When the kernel stores into addresses that it will execute
out of (eg when loading modules), this function is called. out of (eg when loading modules), this function is called.
If the icache does not snoop stores then this routine will need If the icache does not snoop stores then this routine will need
to flush it. to flush it.
void flush_icache_page(struct vm_area_struct *vma, struct page *page) ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
All the functionality of flush_icache_page can be implemented in All the functionality of flush_icache_page can be implemented in
flush_dcache_page and update_mmu_cache. In the future, the hope flush_dcache_page and update_mmu_cache. In the future, the hope
is to remove this interface completely. is to remove this interface completely.
...@@ -387,7 +397,8 @@ the kernel trying to do I/O to vmap areas must manually manage ...@@ -387,7 +397,8 @@ the kernel trying to do I/O to vmap areas must manually manage
coherency. It must do this by flushing the vmap range before doing coherency. It must do this by flushing the vmap range before doing
I/O and invalidating it after the I/O returns. I/O and invalidating it after the I/O returns.
void flush_kernel_vmap_range(void *vaddr, int size) ``void flush_kernel_vmap_range(void *vaddr, int size)``
flushes the kernel cache for a given virtual address range in flushes the kernel cache for a given virtual address range in
the vmap area. This is to make sure that any data the kernel the vmap area. This is to make sure that any data the kernel
modified in the vmap range is made visible to the physical modified in the vmap range is made visible to the physical
...@@ -395,7 +406,8 @@ I/O and invalidating it after the I/O returns. ...@@ -395,7 +406,8 @@ I/O and invalidating it after the I/O returns.
Note that this API does *not* also flush the offset map alias Note that this API does *not* also flush the offset map alias
of the area. of the area.
void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates ``void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates``
the cache for a given virtual address range in the vmap area the cache for a given virtual address range in the vmap area
which prevents the processor from making the cache stale by which prevents the processor from making the cache stale by
speculatively reading data while the I/O was occurring to the speculatively reading data while the I/O was occurring to the
......
此差异已折叠。
================ ================
CIRCULAR BUFFERS Circular Buffers
================ ================
By: David Howells <dhowells@redhat.com> :Author: David Howells <dhowells@redhat.com>
Paul E. McKenney <paulmck@linux.vnet.ibm.com> :Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Linux provides a number of features that can be used to implement circular Linux provides a number of features that can be used to implement circular
...@@ -20,7 +20,7 @@ producer and just one consumer. It is possible to handle multiple producers by ...@@ -20,7 +20,7 @@ producer and just one consumer. It is possible to handle multiple producers by
serialising them, and to handle multiple consumers by serialising them. serialising them, and to handle multiple consumers by serialising them.
Contents: .. Contents:
(*) What is a circular buffer? (*) What is a circular buffer?
...@@ -31,8 +31,8 @@ Contents: ...@@ -31,8 +31,8 @@ Contents:
- The consumer. - The consumer.
==========================
WHAT IS A CIRCULAR BUFFER? What is a circular buffer?
========================== ==========================
First of all, what is a circular buffer? A circular buffer is a buffer of First of all, what is a circular buffer? A circular buffer is a buffer of
...@@ -60,9 +60,7 @@ buffer, provided that neither index overtakes the other. The implementer must ...@@ -60,9 +60,7 @@ buffer, provided that neither index overtakes the other. The implementer must
be careful, however, as a region more than one unit in size may wrap the end of be careful, however, as a region more than one unit in size may wrap the end of
the buffer and be broken into two segments. the buffer and be broken into two segments.
Measuring power-of-2 buffers
============================
MEASURING POWER-OF-2 BUFFERS
============================ ============================
Calculation of the occupancy or the remaining capacity of an arbitrarily sized Calculation of the occupancy or the remaining capacity of an arbitrarily sized
...@@ -71,13 +69,13 @@ modulus (divide) instruction. However, if the buffer is of a power-of-2 size, ...@@ -71,13 +69,13 @@ modulus (divide) instruction. However, if the buffer is of a power-of-2 size,
then a much quicker bitwise-AND instruction can be used instead. then a much quicker bitwise-AND instruction can be used instead.
Linux provides a set of macros for handling power-of-2 circular buffers. These Linux provides a set of macros for handling power-of-2 circular buffers. These
can be made use of by: can be made use of by::
#include <linux/circ_buf.h> #include <linux/circ_buf.h>
The macros are: The macros are:
(*) Measure the remaining capacity of a buffer: (#) Measure the remaining capacity of a buffer::
CIRC_SPACE(head_index, tail_index, buffer_size); CIRC_SPACE(head_index, tail_index, buffer_size);
...@@ -85,7 +83,7 @@ The macros are: ...@@ -85,7 +83,7 @@ The macros are:
can be inserted. can be inserted.
(*) Measure the maximum consecutive immediate space in a buffer: (#) Measure the maximum consecutive immediate space in a buffer::
CIRC_SPACE_TO_END(head_index, tail_index, buffer_size); CIRC_SPACE_TO_END(head_index, tail_index, buffer_size);
...@@ -94,14 +92,14 @@ The macros are: ...@@ -94,14 +92,14 @@ The macros are:
beginning of the buffer. beginning of the buffer.
(*) Measure the occupancy of a buffer: (#) Measure the occupancy of a buffer::
CIRC_CNT(head_index, tail_index, buffer_size); CIRC_CNT(head_index, tail_index, buffer_size);
This returns the number of items currently occupying a buffer[2]. This returns the number of items currently occupying a buffer[2].
(*) Measure the non-wrapping occupancy of a buffer: (#) Measure the non-wrapping occupancy of a buffer::
CIRC_CNT_TO_END(head_index, tail_index, buffer_size); CIRC_CNT_TO_END(head_index, tail_index, buffer_size);
...@@ -112,7 +110,7 @@ The macros are: ...@@ -112,7 +110,7 @@ The macros are:
Each of these macros will nominally return a value between 0 and buffer_size-1, Each of these macros will nominally return a value between 0 and buffer_size-1,
however: however:
[1] CIRC_SPACE*() are intended to be used in the producer. To the producer (1) CIRC_SPACE*() are intended to be used in the producer. To the producer
they will return a lower bound as the producer controls the head index, they will return a lower bound as the producer controls the head index,
but the consumer may still be depleting the buffer on another CPU and but the consumer may still be depleting the buffer on another CPU and
moving the tail index. moving the tail index.
...@@ -120,7 +118,7 @@ however: ...@@ -120,7 +118,7 @@ however:
To the consumer it will show an upper bound as the producer may be busy To the consumer it will show an upper bound as the producer may be busy
depleting the space. depleting the space.
[2] CIRC_CNT*() are intended to be used in the consumer. To the consumer they (2) CIRC_CNT*() are intended to be used in the consumer. To the consumer they
will return a lower bound as the consumer controls the tail index, but the will return a lower bound as the consumer controls the tail index, but the
producer may still be filling the buffer on another CPU and moving the producer may still be filling the buffer on another CPU and moving the
head index. head index.
...@@ -128,14 +126,12 @@ however: ...@@ -128,14 +126,12 @@ however:
To the producer it will show an upper bound as the consumer may be busy To the producer it will show an upper bound as the consumer may be busy
emptying the buffer. emptying the buffer.
[3] To a third party, the order in which the writes to the indices by the (3) To a third party, the order in which the writes to the indices by the
producer and consumer become visible cannot be guaranteed as they are producer and consumer become visible cannot be guaranteed as they are
independent and may be made on different CPUs - so the result in such a independent and may be made on different CPUs - so the result in such a
situation will merely be a guess, and may even be negative. situation will merely be a guess, and may even be negative.
Using memory barriers with circular buffers
===========================================
USING MEMORY BARRIERS WITH CIRCULAR BUFFERS
=========================================== ===========================================
By using memory barriers in conjunction with circular buffers, you can avoid By using memory barriers in conjunction with circular buffers, you can avoid
...@@ -152,10 +148,10 @@ time, and only one thing should be emptying a buffer at any one time, but the ...@@ -152,10 +148,10 @@ time, and only one thing should be emptying a buffer at any one time, but the
two sides can operate simultaneously. two sides can operate simultaneously.
THE PRODUCER The producer
------------ ------------
The producer will look something like this: The producer will look something like this::
spin_lock(&producer_lock); spin_lock(&producer_lock);
...@@ -193,10 +189,10 @@ ordering between the read of the index indicating that the consumer has ...@@ -193,10 +189,10 @@ ordering between the read of the index indicating that the consumer has
vacated a given element and the write by the producer to that same element. vacated a given element and the write by the producer to that same element.
THE CONSUMER The Consumer
------------ ------------
The consumer will look something like this: The consumer will look something like this::
spin_lock(&consumer_lock); spin_lock(&consumer_lock);
...@@ -235,8 +231,7 @@ prevents the compiler from tearing the store, and enforces ordering ...@@ -235,8 +231,7 @@ prevents the compiler from tearing the store, and enforces ordering
against previous accesses. against previous accesses.
=============== Further reading
FURTHER READING
=============== ===============
See also Documentation/memory-barriers.txt for a description of Linux's memory See also Documentation/memory-barriers.txt for a description of Linux's memory
......
The Common Clk Framework ========================
Mike Turquette <mturquette@ti.com> The Common Clk Framework
========================
:Author: Mike Turquette <mturquette@ti.com>
This document endeavours to explain the common clk framework details, This document endeavours to explain the common clk framework details,
and how to port a platform over to this framework. It is not yet a and how to port a platform over to this framework. It is not yet a
detailed explanation of the clock api in include/linux/clk.h, but detailed explanation of the clock api in include/linux/clk.h, but
perhaps someday it will include that information. perhaps someday it will include that information.
Part 1 - introduction and interface split Introduction and interface split
================================
The common clk framework is an interface to control the clock nodes The common clk framework is an interface to control the clock nodes
available on various devices today. This may come in the form of clock available on various devices today. This may come in the form of clock
...@@ -35,10 +39,11 @@ is defined in struct clk_foo and pointed to within struct clk_core. This ...@@ -35,10 +39,11 @@ is defined in struct clk_foo and pointed to within struct clk_core. This
allows for easy navigation between the two discrete halves of the common allows for easy navigation between the two discrete halves of the common
clock interface. clock interface.
Part 2 - common data structures and api Common data structures and api
==============================
Below is the common struct clk_core definition from Below is the common struct clk_core definition from
drivers/clk/clk.c, modified for brevity: drivers/clk/clk.c, modified for brevity::
struct clk_core { struct clk_core {
const char *name; const char *name;
...@@ -59,7 +64,7 @@ struct clk. That api is documented in include/linux/clk.h. ...@@ -59,7 +64,7 @@ struct clk. That api is documented in include/linux/clk.h.
Platforms and devices utilizing the common struct clk_core use the struct Platforms and devices utilizing the common struct clk_core use the struct
clk_ops pointer in struct clk_core to perform the hardware-specific parts of clk_ops pointer in struct clk_core to perform the hardware-specific parts of
the operations defined in clk-provider.h: the operations defined in clk-provider.h::
struct clk_ops { struct clk_ops {
int (*prepare)(struct clk_hw *hw); int (*prepare)(struct clk_hw *hw);
...@@ -95,19 +100,20 @@ the operations defined in clk-provider.h: ...@@ -95,19 +100,20 @@ the operations defined in clk-provider.h:
struct dentry *dentry); struct dentry *dentry);
}; };
Part 3 - hardware clk implementations Hardware clk implementations
============================
The strength of the common struct clk_core comes from its .ops and .hw pointers The strength of the common struct clk_core comes from its .ops and .hw pointers
which abstract the details of struct clk from the hardware-specific bits, and which abstract the details of struct clk from the hardware-specific bits, and
vice versa. To illustrate consider the simple gateable clk implementation in vice versa. To illustrate consider the simple gateable clk implementation in
drivers/clk/clk-gate.c: drivers/clk/clk-gate.c::
struct clk_gate { struct clk_gate {
struct clk_hw hw; struct clk_hw hw;
void __iomem *reg; void __iomem *reg;
u8 bit_idx; u8 bit_idx;
... ...
}; };
struct clk_gate contains struct clk_hw hw as well as hardware-specific struct clk_gate contains struct clk_hw hw as well as hardware-specific
knowledge about which register and bit controls this clk's gating. knowledge about which register and bit controls this clk's gating.
...@@ -115,7 +121,7 @@ Nothing about clock topology or accounting, such as enable_count or ...@@ -115,7 +121,7 @@ Nothing about clock topology or accounting, such as enable_count or
notifier_count, is needed here. That is all handled by the common notifier_count, is needed here. That is all handled by the common
framework code and struct clk_core. framework code and struct clk_core.
Let's walk through enabling this clk from driver code: Let's walk through enabling this clk from driver code::
struct clk *clk; struct clk *clk;
clk = clk_get(NULL, "my_gateable_clk"); clk = clk_get(NULL, "my_gateable_clk");
...@@ -123,70 +129,71 @@ Let's walk through enabling this clk from driver code: ...@@ -123,70 +129,71 @@ Let's walk through enabling this clk from driver code:
clk_prepare(clk); clk_prepare(clk);
clk_enable(clk); clk_enable(clk);
The call graph for clk_enable is very simple: The call graph for clk_enable is very simple::
clk_enable(clk); clk_enable(clk);
clk->ops->enable(clk->hw); clk->ops->enable(clk->hw);
[resolves to...] [resolves to...]
clk_gate_enable(hw); clk_gate_enable(hw);
[resolves struct clk gate with to_clk_gate(hw)] [resolves struct clk gate with to_clk_gate(hw)]
clk_gate_set_bit(gate); clk_gate_set_bit(gate);
And the definition of clk_gate_set_bit: And the definition of clk_gate_set_bit::
static void clk_gate_set_bit(struct clk_gate *gate) static void clk_gate_set_bit(struct clk_gate *gate)
{ {
u32 reg; u32 reg;
reg = __raw_readl(gate->reg); reg = __raw_readl(gate->reg);
reg |= BIT(gate->bit_idx); reg |= BIT(gate->bit_idx);
writel(reg, gate->reg); writel(reg, gate->reg);
} }
Note that to_clk_gate is defined as: Note that to_clk_gate is defined as::
#define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw) #define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw)
This pattern of abstraction is used for every clock hardware This pattern of abstraction is used for every clock hardware
representation. representation.
Part 4 - supporting your own clk hardware Supporting your own clk hardware
================================
When implementing support for a new type of clock it is only necessary to When implementing support for a new type of clock it is only necessary to
include the following header: include the following header::
#include <linux/clk-provider.h> #include <linux/clk-provider.h>
To construct a clk hardware structure for your platform you must define To construct a clk hardware structure for your platform you must define
the following: the following::
struct clk_foo { struct clk_foo {
struct clk_hw hw; struct clk_hw hw;
... hardware specific data goes here ... ... hardware specific data goes here ...
}; };
To take advantage of your data you'll need to support valid operations To take advantage of your data you'll need to support valid operations
for your clk: for your clk::
struct clk_ops clk_foo_ops { struct clk_ops clk_foo_ops {
.enable = &clk_foo_enable; .enable = &clk_foo_enable;
.disable = &clk_foo_disable; .disable = &clk_foo_disable;
}; };
Implement the above functions using container_of: Implement the above functions using container_of::
#define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw) #define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw)
int clk_foo_enable(struct clk_hw *hw) int clk_foo_enable(struct clk_hw *hw)
{ {
struct clk_foo *foo; struct clk_foo *foo;
foo = to_clk_foo(hw); foo = to_clk_foo(hw);
... perform magic on foo ... ... perform magic on foo ...
return 0; return 0;
}; };
Below is a matrix detailing which clk_ops are mandatory based upon the Below is a matrix detailing which clk_ops are mandatory based upon the
hardware capabilities of that clock. A cell marked as "y" means hardware capabilities of that clock. A cell marked as "y" means
...@@ -194,41 +201,56 @@ mandatory, a cell marked as "n" implies that either including that ...@@ -194,41 +201,56 @@ mandatory, a cell marked as "n" implies that either including that
callback is invalid or otherwise unnecessary. Empty cells are either callback is invalid or otherwise unnecessary. Empty cells are either
optional or must be evaluated on a case-by-case basis. optional or must be evaluated on a case-by-case basis.
clock hardware characteristics .. table:: clock hardware characteristics
-----------------------------------------------------------
| gate | change rate | single parent | multiplexer | root | +----------------+------+-------------+---------------+-------------+------+
|------|-------------|---------------|-------------|------| | | gate | change rate | single parent | multiplexer | root |
.prepare | | | | | | +================+======+=============+===============+=============+======+
.unprepare | | | | | | |.prepare | | | | | |
| | | | | | +----------------+------+-------------+---------------+-------------+------+
.enable | y | | | | | |.unprepare | | | | | |
.disable | y | | | | | +----------------+------+-------------+---------------+-------------+------+
.is_enabled | y | | | | | +----------------+------+-------------+---------------+-------------+------+
| | | | | | |.enable | y | | | | |
.recalc_rate | | y | | | | +----------------+------+-------------+---------------+-------------+------+
.round_rate | | y [1] | | | | |.disable | y | | | | |
.determine_rate | | y [1] | | | | +----------------+------+-------------+---------------+-------------+------+
.set_rate | | y | | | | |.is_enabled | y | | | | |
| | | | | | +----------------+------+-------------+---------------+-------------+------+
.set_parent | | | n | y | n | +----------------+------+-------------+---------------+-------------+------+
.get_parent | | | n | y | n | |.recalc_rate | | y | | | |
| | | | | | +----------------+------+-------------+---------------+-------------+------+
.recalc_accuracy| | | | | | |.round_rate | | y [1]_ | | | |
| | | | | | +----------------+------+-------------+---------------+-------------+------+
.init | | | | | | |.determine_rate | | y [1]_ | | | |
----------------------------------------------------------- +----------------+------+-------------+---------------+-------------+------+
[1] either one of round_rate or determine_rate is required. |.set_rate | | y | | | |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.set_parent | | | n | y | n |
+----------------+------+-------------+---------------+-------------+------+
|.get_parent | | | n | y | n |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.recalc_accuracy| | | | | |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.init | | | | | |
+----------------+------+-------------+---------------+-------------+------+
.. [1] either one of round_rate or determine_rate is required.
Finally, register your clock at run-time with a hardware-specific Finally, register your clock at run-time with a hardware-specific
registration function. This function simply populates struct clk_foo's registration function. This function simply populates struct clk_foo's
data and then passes the common struct clk parameters to the framework data and then passes the common struct clk parameters to the framework
with a call to: with a call to::
clk_register(...) clk_register(...)
See the basic clock types in drivers/clk/clk-*.c for examples. See the basic clock types in ``drivers/clk/clk-*.c`` for examples.
Part 5 - Disabling clock gating of unused clocks Disabling clock gating of unused clocks
=======================================
Sometimes during development it can be useful to be able to bypass the Sometimes during development it can be useful to be able to bypass the
default disabling of unused clocks. For example, if drivers aren't enabling default disabling of unused clocks. For example, if drivers aren't enabling
...@@ -239,7 +261,8 @@ are sorted out. ...@@ -239,7 +261,8 @@ are sorted out.
To bypass this disabling, include "clk_ignore_unused" in the bootargs to the To bypass this disabling, include "clk_ignore_unused" in the bootargs to the
kernel. kernel.
Part 6 - Locking Locking
=======
The common clock framework uses two global locks, the prepare lock and the The common clock framework uses two global locks, the prepare lock and the
enable lock. enable lock.
......
========
CPU load CPU load
-------- ========
Linux exports various bits of information via `/proc/stat' and Linux exports various bits of information via ``/proc/stat`` and
`/proc/uptime' that userland tools, such as top(1), use to calculate ``/proc/uptime`` that userland tools, such as top(1), use to calculate
the average time system spent in a particular state, for example: the average time system spent in a particular state, for example::
$ iostat $ iostat
Linux 2.6.18.3-exp (linmac) 02/20/2007 Linux 2.6.18.3-exp (linmac) 02/20/2007
...@@ -17,7 +18,7 @@ Here the system thinks that over the default sampling period the ...@@ -17,7 +18,7 @@ Here the system thinks that over the default sampling period the
system spent 10.01% of the time doing work in user space, 2.92% in the system spent 10.01% of the time doing work in user space, 2.92% in the
kernel, and was overall 81.63% of the time idle. kernel, and was overall 81.63% of the time idle.
In most cases the `/proc/stat' information reflects the reality quite In most cases the ``/proc/stat`` information reflects the reality quite
closely, however due to the nature of how/when the kernel collects closely, however due to the nature of how/when the kernel collects
this data sometimes it can not be trusted at all. this data sometimes it can not be trusted at all.
...@@ -33,78 +34,78 @@ Example ...@@ -33,78 +34,78 @@ Example
------- -------
If we imagine the system with one task that periodically burns cycles If we imagine the system with one task that periodically burns cycles
in the following manner: in the following manner::
time line between two timer interrupts time line between two timer interrupts
|--------------------------------------| |--------------------------------------|
^ ^ ^ ^
|_ something begins working | |_ something begins working |
|_ something goes to sleep |_ something goes to sleep
(only to be awaken quite soon) (only to be awaken quite soon)
In the above situation the system will be 0% loaded according to the In the above situation the system will be 0% loaded according to the
`/proc/stat' (since the timer interrupt will always happen when the ``/proc/stat`` (since the timer interrupt will always happen when the
system is executing the idle handler), but in reality the load is system is executing the idle handler), but in reality the load is
closer to 99%. closer to 99%.
One can imagine many more situations where this behavior of the kernel One can imagine many more situations where this behavior of the kernel
will lead to quite erratic information inside `/proc/stat'. will lead to quite erratic information inside ``/proc/stat``::
/* gcc -o hog smallhog.c */ /* gcc -o hog smallhog.c */
#include <time.h> #include <time.h>
#include <limits.h> #include <limits.h>
#include <signal.h> #include <signal.h>
#include <sys/time.h> #include <sys/time.h>
#define HIST 10 #define HIST 10
static volatile sig_atomic_t stop; static volatile sig_atomic_t stop;
static void sighandler (int signr) static void sighandler (int signr)
{ {
(void) signr; (void) signr;
stop = 1; stop = 1;
} }
static unsigned long hog (unsigned long niters) static unsigned long hog (unsigned long niters)
{ {
stop = 0; stop = 0;
while (!stop && --niters); while (!stop && --niters);
return niters; return niters;
} }
int main (void) int main (void)
{ {
int i; int i;
struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 }, struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
.it_value = { .tv_sec = 0, .tv_usec = 1 } }; .it_value = { .tv_sec = 0, .tv_usec = 1 } };
sigset_t set; sigset_t set;
unsigned long v[HIST]; unsigned long v[HIST];
double tmp = 0.0; double tmp = 0.0;
unsigned long n; unsigned long n;
signal (SIGALRM, &sighandler); signal (SIGALRM, &sighandler);
setitimer (ITIMER_REAL, &it, NULL); setitimer (ITIMER_REAL, &it, NULL);
hog (ULONG_MAX); hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX); for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) tmp += v[i]; for (i = 0; i < HIST; ++i) tmp += v[i];
tmp /= HIST; tmp /= HIST;
n = tmp - (tmp / 3.0); n = tmp - (tmp / 3.0);
sigemptyset (&set); sigemptyset (&set);
sigaddset (&set, SIGALRM); sigaddset (&set, SIGALRM);
for (;;) { for (;;) {
hog (n); hog (n);
sigwait (&set, &i); sigwait (&set, &i);
} }
return 0; return 0;
} }
References References
---------- ----------
http://lkml.org/lkml/2007/2/12/6 - http://lkml.org/lkml/2007/2/12/6
Documentation/filesystems/proc.txt (1.8) - Documentation/filesystems/proc.txt (1.8)
Thanks Thanks
......
===========================================
How CPU topology info is exported via sysfs
===========================================
Export CPU topology info via sysfs. Items (attributes) are similar Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo output of some architectures: to /proc/cpuinfo output of some architectures:
...@@ -75,24 +78,26 @@ CONFIG_SCHED_BOOK and CONFIG_DRAWER are currently only used on s390, where ...@@ -75,24 +78,26 @@ CONFIG_SCHED_BOOK and CONFIG_DRAWER are currently only used on s390, where
they reflect the cpu and cache hierarchy. they reflect the cpu and cache hierarchy.
For an architecture to support this feature, it must define some of For an architecture to support this feature, it must define some of
these macros in include/asm-XXX/topology.h: these macros in include/asm-XXX/topology.h::
#define topology_physical_package_id(cpu)
#define topology_core_id(cpu) #define topology_physical_package_id(cpu)
#define topology_book_id(cpu) #define topology_core_id(cpu)
#define topology_drawer_id(cpu) #define topology_book_id(cpu)
#define topology_sibling_cpumask(cpu) #define topology_drawer_id(cpu)
#define topology_core_cpumask(cpu) #define topology_sibling_cpumask(cpu)
#define topology_book_cpumask(cpu) #define topology_core_cpumask(cpu)
#define topology_drawer_cpumask(cpu) #define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
The type of **_id macros is int.
The type of **_cpumask macros is (const) struct cpumask *. The latter The type of ``**_id macros`` is int.
correspond with appropriate **_siblings sysfs attributes (except for The type of ``**_cpumask macros`` is ``(const) struct cpumask *``. The latter
correspond with appropriate ``**_siblings`` sysfs attributes (except for
topology_sibling_cpumask() which corresponds with thread_siblings). topology_sibling_cpumask() which corresponds with thread_siblings).
To be consistent on all architectures, include/linux/topology.h To be consistent on all architectures, include/linux/topology.h
provides default definitions for any of the above macros that are provides default definitions for any of the above macros that are
not defined by include/asm-XXX/topology.h: not defined by include/asm-XXX/topology.h:
1) physical_package_id: -1 1) physical_package_id: -1
2) core_id: 0 2) core_id: 0
3) sibling_cpumask: just the given CPU 3) sibling_cpumask: just the given CPU
...@@ -107,6 +112,7 @@ Additionally, CPU topology information is provided under ...@@ -107,6 +112,7 @@ Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal /sys/devices/system/cpu and includes these files. The internal
source for the output is in brackets ("[]"). source for the output is in brackets ("[]").
=========== ==========================================================
kernel_max: the maximum CPU index allowed by the kernel configuration. kernel_max: the maximum CPU index allowed by the kernel configuration.
[NR_CPUS-1] [NR_CPUS-1]
...@@ -122,6 +128,7 @@ source for the output is in brackets ("[]"). ...@@ -122,6 +128,7 @@ source for the output is in brackets ("[]").
present: CPUs that have been identified as being present in the present: CPUs that have been identified as being present in the
system. [cpu_present_mask] system. [cpu_present_mask]
=========== ==========================================================
The format for the above output is compatible with cpulist_parse() The format for the above output is compatible with cpulist_parse()
[see <linux/cpumask.h>]. Some examples follow. [see <linux/cpumask.h>]. Some examples follow.
...@@ -129,7 +136,7 @@ The format for the above output is compatible with cpulist_parse() ...@@ -129,7 +136,7 @@ The format for the above output is compatible with cpulist_parse()
In this example, there are 64 CPUs in the system but cpus 32-63 exceed In this example, there are 64 CPUs in the system but cpus 32-63 exceed
the kernel max which is limited to 0..31 by the NR_CPUS config option the kernel max which is limited to 0..31 by the NR_CPUS config option
being 32. Note also that CPUs 2 and 4-31 are not online but could be being 32. Note also that CPUs 2 and 4-31 are not online but could be
brought online as they are both present and possible. brought online as they are both present and possible::
kernel_max: 31 kernel_max: 31
offline: 2,4-31,32-63 offline: 2,4-31,32-63
...@@ -140,7 +147,7 @@ brought online as they are both present and possible. ...@@ -140,7 +147,7 @@ brought online as they are both present and possible.
In this example, the NR_CPUS config option is 128, but the kernel was In this example, the NR_CPUS config option is 128, but the kernel was
started with possible_cpus=144. There are 4 CPUs in the system and cpu2 started with possible_cpus=144. There are 4 CPUs in the system and cpu2
was manually taken offline (and is the only CPU that can be brought was manually taken offline (and is the only CPU that can be brought
online.) online.)::
kernel_max: 127 kernel_max: 127
offline: 2,4-127,128-143 offline: 2,4-127,128-143
......
此差异已折叠。
===================================
Dell Systems Management Base Driver
===================================
Overview Overview
========
The Dell Systems Management Base Driver provides a sysfs interface for The Dell Systems Management Base Driver provides a sysfs interface for
systems management software such as Dell OpenManage to perform system systems management software such as Dell OpenManage to perform system
...@@ -17,6 +22,7 @@ more information about the libsmbios project. ...@@ -17,6 +22,7 @@ more information about the libsmbios project.
System Management Interrupt System Management Interrupt
===========================
On some Dell systems, systems management software must access certain On some Dell systems, systems management software must access certain
management information via a system management interrupt (SMI). The SMI data management information via a system management interrupt (SMI). The SMI data
...@@ -24,12 +30,12 @@ buffer must reside in 32-bit address space, and the physical address of the ...@@ -24,12 +30,12 @@ buffer must reside in 32-bit address space, and the physical address of the
buffer is required for the SMI. The driver maintains the memory required for buffer is required for the SMI. The driver maintains the memory required for
the SMI and provides a way for the application to generate the SMI. the SMI and provides a way for the application to generate the SMI.
The driver creates the following sysfs entries for systems management The driver creates the following sysfs entries for systems management
software to perform these system management interrupts: software to perform these system management interrupts::
/sys/devices/platform/dcdbas/smi_data /sys/devices/platform/dcdbas/smi_data
/sys/devices/platform/dcdbas/smi_data_buf_phys_addr /sys/devices/platform/dcdbas/smi_data_buf_phys_addr
/sys/devices/platform/dcdbas/smi_data_buf_size /sys/devices/platform/dcdbas/smi_data_buf_size
/sys/devices/platform/dcdbas/smi_request /sys/devices/platform/dcdbas/smi_request
Systems management software must perform the following steps to execute Systems management software must perform the following steps to execute
a SMI using this driver: a SMI using this driver:
...@@ -43,6 +49,7 @@ a SMI using this driver: ...@@ -43,6 +49,7 @@ a SMI using this driver:
Host Control Action Host Control Action
===================
Dell OpenManage supports a host control feature that allows the administrator Dell OpenManage supports a host control feature that allows the administrator
to perform a power cycle or power off of the system after the OS has finished to perform a power cycle or power off of the system after the OS has finished
...@@ -69,12 +76,14 @@ power off host control action using this driver: ...@@ -69,12 +76,14 @@ power off host control action using this driver:
Host Control SMI Type Host Control SMI Type
=====================
The following table shows the value to write to host_control_smi_type to The following table shows the value to write to host_control_smi_type to
perform a power cycle or power off host control action: perform a power cycle or power off host control action:
=================== =====================
PowerEdge System Host Control SMI Type PowerEdge System Host Control SMI Type
---------------- --------------------- =================== =====================
300 HC_SMITYPE_TYPE1 300 HC_SMITYPE_TYPE1
1300 HC_SMITYPE_TYPE1 1300 HC_SMITYPE_TYPE1
1400 HC_SMITYPE_TYPE2 1400 HC_SMITYPE_TYPE2
...@@ -87,5 +96,4 @@ PowerEdge System Host Control SMI Type ...@@ -87,5 +96,4 @@ PowerEdge System Host Control SMI Type
1655MC HC_SMITYPE_TYPE2 1655MC HC_SMITYPE_TYPE2
700 HC_SMITYPE_TYPE3 700 HC_SMITYPE_TYPE3
750 HC_SMITYPE_TYPE3 750 HC_SMITYPE_TYPE3
=================== =====================
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
==========================================================
ISA Plug & Play support by Jaroslav Kysela <perex@suse.cz> ISA Plug & Play support by Jaroslav Kysela <perex@suse.cz>
========================================================== ==========================================================
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册