1136 lines
42 KiB
Plaintext
1136 lines
42 KiB
Plaintext
************************************************************
|
|
* *
|
|
* Tutorial To Linux Driver Writing -- Character Devices *
|
|
* *
|
|
* or, *
|
|
* *
|
|
* Now That I'm Wacky, Let Me Do Something (I) *
|
|
* *
|
|
* Last Revision: Apr 11, 1993 *
|
|
* *
|
|
************************************************************
|
|
|
|
This document (C) 1993 Robert Baruch. This document may be freely
|
|
copied as long as the entire title, copyright, this notice, and all of
|
|
the introduction are included along with it. Suggestions, criticisms,
|
|
and comments to baruch@nynexst.com. This document, nor the work
|
|
performed by Robert Baruch using Linux, nor the results of said work
|
|
are connected in any way to any of the Nynex companies. This product
|
|
0% organic as defined by California Statute 4Z//7&A. No artificial
|
|
coloring or flavoring.
|
|
|
|
========================
|
|
Introduction
|
|
========================
|
|
|
|
There is a companion guide to this Tutorial, the Guide to Linux Driver
|
|
Writing -- Character Devices This Guide should serve as a reference to
|
|
both beginning and advanced driver writers, and should be used in
|
|
conjunction with this Tutorial.
|
|
|
|
-=-=-=-=-=-=-
|
|
|
|
Some words of thanks:
|
|
|
|
Many thanks to:
|
|
|
|
Donald J. Becker (becker@metropolis.super.org)
|
|
Don "May the Source be With You!" Holzworth (donh@gcx1.ssd.csd.harris.com)
|
|
Michael Johnson (johnsonm@stolaf.edu)
|
|
Karl Heinz Kremer (khk@raster.kodak.com)
|
|
Pat Mackinlay (mackinla@cs.curtin.edu.au)
|
|
...others too numerous to mention...
|
|
All the driver writers!
|
|
|
|
...and of course, Linus "That's LIN-uhks" Torvalds and all the guys who helped
|
|
develop Linux into a BLOODY KICKIN' O/S!
|
|
|
|
-=-=-=-=-=-=-
|
|
|
|
...and now a word of warning:
|
|
|
|
Messing about with drivers is messing with the kernel. Drivers are run
|
|
at the kernel level, and as such are not subject to scheduling. Further,
|
|
drivers have access to various kernel structures. Before you actually
|
|
write a driver, be *damned* sure of what you are doing, lest you end
|
|
up having to re-format your harddrive and re-install Linux!
|
|
|
|
The information in this Tutorial is as up-to-date as I could make it. It also
|
|
has no stamp of approval whatsoever by any of the designers of the kernel.
|
|
I am not responsible for damage caused to anything as a result of using this
|
|
Guide.
|
|
|
|
========================
|
|
End of Introduction
|
|
========================
|
|
|
|
|
|
CHAPTRE THE FIRSTE : How did *they* get the device driver in the kernel?
|
|
------------------
|
|
|
|
You have to realize that device drivers really are part of the kernel. The
|
|
kernel can hook in to the functions in your device driver if you tell it
|
|
the addresses of some standard functions. These standard functions are
|
|
detailed in the Guide.
|
|
|
|
As a part of the kernel, the code of the device driver must be compiled in
|
|
*with* the kernel. That is, you must alter some Makefiles to compile your
|
|
driver and to get it archived into the chr_drv.a library, or you can
|
|
archive it yourself and link it in to the kernel at a later compile stage.
|
|
|
|
The first step, before you even write a single line of driver code, is to
|
|
make sure you know how to recompile the kernel. Then go ahead and actually
|
|
do it, to be sure you (and your system) are sane. Of course, you need
|
|
the sources to the kernel. If you have the SLS distribution of Linux, you
|
|
already have the sources in /linux. If you don't have the sources, you
|
|
can get it at one of these fine ftp sites near you:
|
|
|
|
tsx-11.mit.edu:/pub/linux
|
|
sunsite.unc.edu:/pub/Linux
|
|
|
|
Briefly, here's how to compile the kernel (at least this is how it's done
|
|
in the SLS release):
|
|
|
|
Go to /linux (or wherever the source for Linux is)
|
|
You will see a directory which looks a lot like this:
|
|
|
|
-rw-r--r-- 1 baruch 17982 Nov 10 07:54 COPYING
|
|
-rw-r--r-- 1 baruch 1444 Jan 13 15:24 Configure
|
|
-rw-r--r-- 1 baruch 6934 Feb 22 13:31 Makefile
|
|
-rw-r--r-- 1 baruch 4078 Dec 12 06:45 README
|
|
drwxrwxr-x 2 baruch 512 Feb 22 13:34 boot
|
|
-rw-r--r-- 1 baruch 1724 Feb 9 15:07 config.in
|
|
drwxrwxr-x 8 baruch 512 Feb 22 13:34 fs
|
|
drwxrwxr-x 4 baruch 512 Dec 1 19:40 include
|
|
drwxrwxr-x 2 baruch 512 Feb 22 13:34 init
|
|
drwxrwxr-x 5 baruch 512 Feb 9 15:11 kernel
|
|
drwxrwxr-x 2 baruch 512 Feb 9 15:11 lib
|
|
-rwxr-xr-x 1 baruch 166 Nov 10 07:54 makever.sh
|
|
drwxrwxr-x 2 baruch 512 Feb 22 13:34 mm
|
|
drwxrwxr-x 3 baruch 512 Feb 9 15:11 net
|
|
drwxrwxr-x 2 baruch 512 Feb 22 13:34 tools
|
|
drwxrwxr-x 2 baruch 512 Feb 22 13:34 zBoot
|
|
|
|
The README file should contain instructions, but here's how anyway:
|
|
|
|
Log in as root.
|
|
|
|
make clean (Do this only once. Otherwise you'll have to sit around
|
|
for 45 minutes or so while the whole thing recompiles)
|
|
make config (Answer the questions -- usually needed only the first time)
|
|
make dep (Makes dependencies)
|
|
make (makes the kernel)
|
|
|
|
You should end up with an Image file. This is the kernel. Put it where
|
|
you like (LILO users should take it from there). To make a bootable disk,
|
|
just pop a DOS formatted disk in drive A, and do:
|
|
|
|
make disk
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
CHAPTER TWO: The simplest driver you've ever seen.
|
|
------------
|
|
|
|
Now, the directory you're interested in is <src>/kernel/chr_drv. This is
|
|
where all the character device drivers are kept. Go to that directory.
|
|
Open up a new file, and call it testdata.c. Here is what you should
|
|
put in it:
|
|
|
|
========================================
|
|
File Listing 1: testdata.c
|
|
========================================
|
|
|
|
#include <linux/kernel.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/tty.h>
|
|
#include <linux/signal.h>
|
|
#include <linux/errno.h>
|
|
|
|
#include <asm/io.h>
|
|
#include <asm/segment.h>
|
|
#include <asm/system.h>
|
|
#include <asm/irq.h>
|
|
|
|
unsigned long test_init(unsigned long kmem_start)
|
|
{
|
|
printk("Test Data Generator installed.\n");
|
|
return kmem_start;
|
|
}
|
|
|
|
========================================
|
|
|
|
The include files are all there for convenience. You may need them later.
|
|
All this driver does is upon initialization, display a message.
|
|
|
|
Now, to get this driver into the kernel, you need to do several things.
|
|
The first two things do in the chr_drv directory:
|
|
|
|
I. Get the kernel to call your init function on bootup. To do this,
|
|
edit the mem.c file, and go to the very end to the function
|
|
chr_drv_init. It looks something like this:
|
|
|
|
long chr_dev_init(long mem_start, long mem_end)
|
|
{
|
|
if (register_chrdev(1,"mem",&memory_fops))
|
|
printk("unable to get major 1 for memory devs\n");
|
|
mem_start = tty_init(mem_start);
|
|
mem_start = lp_init(mem_start);
|
|
mem_start = mouse_init(mem_start);
|
|
mem_start = soundcard_init(mem_start);
|
|
return mem_start;
|
|
}
|
|
|
|
You need to add your test_init function to the code. Put it right
|
|
before the return:
|
|
|
|
mem_start = test_init(mem_start);
|
|
|
|
Save the file.
|
|
|
|
II. Edit the Makefile to compile testdata.c. Edit the Makefile, and add
|
|
testdata.o to the OBJS list. This will cause the make utility to
|
|
compile testdata.c into an object file, and then add it to the
|
|
chr_drv.a library archive.
|
|
|
|
Save the file.
|
|
|
|
The next step is to re-compile the kernel. Go to the <src> directory,
|
|
and do a make from the top as described in the first chapter. There is
|
|
no point in doing a "make clean" or "make config". If all goes well, the
|
|
make should proceed down to chr_drv, and compile your testdata.c file.
|
|
If there are warnings or errors, do a ctrl-C to break out of the make,
|
|
and fix the problem.
|
|
|
|
Once you are left with an Image file, put the Image file where LILO
|
|
wants it, or use "make disk" to make a bootable disk. It's a good
|
|
idea to save your old Image file (or save the disk it was on).
|
|
|
|
Now reboot. When Linux comes up again, you should see your message
|
|
printed on bootup after all the character devices' messages, before
|
|
any of the block device messages. If the message came up, have a soda.
|
|
Jump up and down a little. (Well, first jump, _then_ have the soda).
|
|
|
|
If it didn't work, go back and find out what you did wrong. Are you
|
|
sure you recompiled the kernel? Did it recompile with testdata.c? Did
|
|
you reboot using the new kernel? Are you sure? Are you root? Maybe
|
|
your kernel is bad or old. I have used 0.99pl6, with the new libc.so.4.3.2
|
|
shared library successfully, and I am currently using 0.99pl8 with
|
|
libc.so.4.3.3.
|
|
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
CHAPTER THREE: A device driver that actually does something useful.
|
|
-------------
|
|
|
|
This example is taken from the _Writing UNIX Device Drivers_ book by
|
|
George Pajari, published by Addison Wesley. It can usually be found
|
|
in a Barnes and Noble bookstore, or any large bookstore which has
|
|
a nice section on UNIX. The ISBN is 0-201-52374-4, and it was published
|
|
in 1992. This book is highly recommended for the device driver writer.
|
|
|
|
This device driver will actually be read from. You can open and close
|
|
it (which really won't do much), but the biggest thing it will do is
|
|
allow you to read from it. This driver won't access any external hardware,
|
|
and so it is called a "pseudo device driver". That is, it really doesn't
|
|
drive any device.
|
|
|
|
Have your Guide handy? OK, now alter your testdata.c file so that it
|
|
looks like this:
|
|
|
|
========================================
|
|
File Listing 2: testdata.c
|
|
========================================
|
|
|
|
#include <linux/kernel.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/tty.h>
|
|
#include <linux/signal.h>
|
|
#include <linux/errno.h>
|
|
|
|
#include <asm/io.h>
|
|
#include <asm/segment.h>
|
|
#include <asm/system.h>
|
|
#include <asm/irq.h>
|
|
|
|
static char test_data[]="Linux is really funky!\n";
|
|
|
|
static int test_read(struct inode * inode, struct file * file,
|
|
char * buffer, int count)
|
|
{
|
|
int offset;
|
|
|
|
printk("Test Data Generator, reading %d bytes\n",count);
|
|
if (count<=0) return -EINVAL;
|
|
for (offset=0; offset<count; offset++)
|
|
put_fs_byte(test_data[offset % (sizeof(test_data)-1)], buffer+offset);
|
|
return offset;
|
|
}
|
|
|
|
static int test_open(struct inode *inode, struct file *file)
|
|
{
|
|
printk("Test Data Generator opened.\n");
|
|
return 0;
|
|
}
|
|
|
|
static void test_release(struct inode *inode, struct file *file)
|
|
{
|
|
printk("Test Data Generator released.\n");
|
|
}
|
|
|
|
struct file_operations test_fops = {
|
|
NULL, /* test_seek */
|
|
test_read, /* test_read */
|
|
NULL, /* test_write */
|
|
NULL, /* test_readdir */
|
|
NULL, /* test_select */
|
|
NULL, /* test_ioctl */
|
|
NULL, /* test_mmap */
|
|
test_open, /* test_open */
|
|
test_release /* test_release */
|
|
};
|
|
|
|
unsigned long test_init(unsigned long kmem_start)
|
|
{
|
|
printk("Test Data Generator installed.\n");
|
|
if (register_chrdev(21,"test",&test_fops));
|
|
printk("Test Data Generator error: Cannot register to major device 21!\n");
|
|
return kmem_start;
|
|
}
|
|
|
|
========================================
|
|
|
|
OK, let's go over this. Look first at the test_init function. Notice
|
|
the new function -- register_chrdev. This registers the character device
|
|
with the kernel as using major device number 21. All devices (except for
|
|
the really simple one in the last chapter) use major device numbers to
|
|
be accessed. The kernel has an internal table of devices and their
|
|
associated device functions which is indexed by major device number.
|
|
|
|
The device numbers go from 0 to MAX_CHRDEV-1. MAX_CHRDEV is defined in
|
|
linux/fs.h, and is currently set at 32. In general, you want to stay away
|
|
from devices 0-15 because those are reserved for the "usual" devices.
|
|
Currently, these usual devices (according to the FAQ) are as follows:
|
|
|
|
---Excerpt from FAQ begins---
|
|
|
|
QUESTION: What are the device minor/major numbers?
|
|
|
|
The Linux Device List
|
|
maintained by rick@ee.uwm.edu (Rick Miller, Linux Device Registrar)
|
|
February 17, 1993
|
|
|
|
Many thanks to richard@stat.tamu.edu, Jim Winstead Jr., and many others.
|
|
|
|
Majors:
|
|
0. Unnamed . (unknown) .... for proc-fs, NFS clients, etc.
|
|
1. Memory .. (character) .. ram, mem, kmem, null, port, zero
|
|
2. Floppy .. (block) ...... fd[0-1]<[dhDH]{360,720,1200,1440}>
|
|
3. Hard Disk (block) ...... hd[a-b]<[0-8]>
|
|
4. Tty ..... (character) .. {p,t}ty<{S,[p-s][0-f]}><#>
|
|
5. tty ..... (character) .. tty, cua[0-63]
|
|
6. Lp ...... (character) .. lp[0-2] or par[0-2]
|
|
7. Tape .... (block) ...... t[0-?] (reserved for Non-SCSI tape drives)
|
|
8. Scsi Disk (block) ...... sd[a-h]<[0-8]>
|
|
9. Scsi Tape (block) ...... <n>rmt[0-1]
|
|
10. Mouse ... (character) .. bm, psaux (mouse)
|
|
11. CD-ROM .. (block) ...... scd[0-1]
|
|
12. QIC-tape? (character) .. rmt{8,16}, tape<{-d,-reset}>
|
|
13. XT-disk . (block) ...... xd[a-b]<[0-8]>
|
|
14. Audio ... (character) .. audio, dsp, midi, mixer, sequencer
|
|
|
|
---Excerpt from FAQ ends---
|
|
|
|
The FAQ goes on to break down the major devices by minor numbers. Each major
|
|
device can be broken down into at most 256 minor devices (0-255). The
|
|
device driver can determine which minor it is supposed to operate on. More
|
|
on that later.
|
|
|
|
In any case, I've chosen major 21 for experimentation purposes. By the way,
|
|
the name of the driver (here it's "test") is not important. The kernel does
|
|
not do anything with it. [It would be nice if it would. Then you could
|
|
interrogate the kernel and find out what drivers are installed!]
|
|
|
|
register_chrdrv also takes in a pointer to a file_operations structure. This
|
|
structure tells the kernel which function to call for which kernel operation.
|
|
The details of this structure is given in the Guide. For now, what is
|
|
important is that we are telling the kernel to call test_read for read
|
|
operations, test_open for open operations, and test_close for release
|
|
operations.
|
|
|
|
If a driver has already taken major 21, register_chrdrv will return -EBUSY.
|
|
Here, all we do is print a message saying that 21 is already taken.
|
|
|
|
Now, the test_open and test_release functions just print out things to
|
|
the console. They are really there for debugging purposes, so that you
|
|
can see when things happen.
|
|
|
|
The meat of the driver is the test_read function. The first thing it does
|
|
is print out how many bytes were requested. Then it puts that many bytes
|
|
into user space. Remember that the driver is executing at the kernel level,
|
|
and the user space will be differnet from kernel space. We have to do
|
|
some kind of translation to put the data which is in kernel space into
|
|
the buffer which is in user space. We use here the put_fs_byte function.
|
|
|
|
The loop puts the string into the buffer, going back to the beginning of
|
|
the string if necessary. Once the loop is finished, we just return the
|
|
actual number of bytes read. The actual number may be different from the
|
|
requested number. For example, you may be reading from the driver some kind
|
|
of message which has a fixed size. You may want to code the driver so that
|
|
if you attempt to read more than the message size, you will get only the
|
|
message size, and no more. Here, we just give the process however many
|
|
bytes it wants.
|
|
|
|
Now, let's get this driver into the kernel. But first what we'll do is
|
|
create a special file which can be opened, read, and closed. Operations on
|
|
this special file will activate your driver code.
|
|
|
|
The special files are normally stored in the /dev directory. Do this:
|
|
|
|
mknod /dev/testdata c 21 0
|
|
chmod 0666 /dev/testdata
|
|
|
|
This makes a special character (c) file called testdata, and gives it major
|
|
21, minor 0. The chmod makes sure that everyone can read and write the
|
|
device.
|
|
|
|
Now recompile the kernel, and reboot. Once again, make sure you fix any
|
|
warnings or errors in your testdata.c compilation.
|
|
|
|
Now, go to the /tmp directory (or whereever you want), and write this
|
|
program:
|
|
|
|
========================================
|
|
File Listing 3: data.c
|
|
========================================
|
|
|
|
#include <stdio.h>
|
|
#include <sys/types.h>
|
|
#include <unistd.h>
|
|
#include <fcntl.h>
|
|
|
|
void main(void)
|
|
{
|
|
int fd;
|
|
char buff[128];
|
|
|
|
fd = open("/dev/testdata",O_RDWR);
|
|
printf("/dev/testdata opened, fd=%d\n",fd);
|
|
if (fd<=0) exit(0);
|
|
printf("sizeof(buff)=%d\n",sizeof(buff));
|
|
printf("Read returns %d\n",read(fd,buff,sizeof(buff)));
|
|
buff[127]=0;
|
|
printf("buff=\n'%s'\n",buff);
|
|
close(fd);
|
|
}
|
|
|
|
========================================
|
|
|
|
Compile it using gcc. Run it. If it said "Linux is really funky!" lots
|
|
of times, pat yourself on the back (or whereever you want) for a job
|
|
well done. If it didn't, check the output, and see where you went wrong.
|
|
It could just be that you have a bad or old kernel.
|
|
|
|
The last line may be partial, since you're only printing out 127 characters.
|
|
|
|
++++++++++++++++++++++
|
|
EXPERIMENT 1
|
|
++++++++++++++++++++++
|
|
|
|
Use mknod to make another special file, this one with minor 1. Call it
|
|
something like /dev/testdata2. Change the device driver so that in the
|
|
read call, it finds out which minor is being read from. Use this:
|
|
|
|
int minor = MINOR(inode->i_rdev);
|
|
|
|
Print out the minor number, and depending on which minor it is, read
|
|
from a different message string. Test your driver with code similar to
|
|
data.c.
|
|
|
|
++++++++++++++++++++++
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
CHAPTER FOUR: You've learned to read, now you're gonna learn to write.
|
|
------------
|
|
|
|
Now that you're reading strings, you may want to write strings and read them
|
|
back. We'll go through two versions of this -- one that uses static memory,
|
|
and one that dynamically allocates the memory.
|
|
|
|
Keeping your current driver, all you need to do is add a write function to
|
|
it, not forgetting to put that write function into the file_operations
|
|
structure of the driver.
|
|
|
|
Add this section of code to your driver above the file_operations structure
|
|
declaration:
|
|
|
|
========================================
|
|
File Listing 4 (partial): testdata.c
|
|
========================================
|
|
|
|
static char test_data[128]="\0";
|
|
static int test_data_size=0;
|
|
|
|
static int test_write(struct inode * inode, struct file * file,
|
|
char * buffer, int count)
|
|
{
|
|
printk("Write %d bytes\n",count);
|
|
if (count>127) return -ENOMEM;
|
|
if ((!test_data_size) || (count<=0)) return -EINVAL;
|
|
memcpy_fromfs((void *)test_data, (void *)buffer, (unsigned long)count);
|
|
test_data[127]=0; /* NUL-terminate the string if necessary */
|
|
test_data_size = count;
|
|
return count;
|
|
}
|
|
|
|
========================================
|
|
|
|
Also, alter the test_read function so that instead of using sizeof(test_data)
|
|
as the size of the test_data string, it uses test_data_size.
|
|
|
|
In the test_write function, I have decided to prevent the acceptance of
|
|
strings which are too big to fit (with a NUL-terminator) into the test_data
|
|
area, rather than just writing only what fits. In this case, if the offered
|
|
string is too long, I return ENOMEM. The write function in the user's
|
|
process will return <0, and errno will be set to ENOMEM.
|
|
|
|
Also note that I have used the memcpy_fromfs function, which is real
|
|
convenient -- much more convenient than looping a put_fs_byte.
|
|
|
|
Compile this driver, and test it by modifying data.c to write some data,
|
|
then read it back.
|
|
|
|
++++++++++++++++++++++
|
|
EXPERIMENT 2
|
|
++++++++++++++++++++++
|
|
|
|
Re-write the driver so that it can have two different strings for the two
|
|
minor devices as in experiment 1.
|
|
|
|
++++++++++++++++++++++
|
|
|
|
Now that we can write data to the driver, it would be nice if we could
|
|
dynamically allocate memory to store a string in. We will use kmalloc to
|
|
do this. (Why is discussed later)
|
|
|
|
One thing which must be realized with kmalloc -- it can only allocate a
|
|
maximum of one Linux page (4096 bytes). If you want more, you will have
|
|
to create a linked list.
|
|
|
|
Change your driver so that instead of listing 4, you have this:
|
|
|
|
========================================
|
|
File Listing 5 (partial): testdata.c
|
|
========================================
|
|
|
|
static char *test_data=NULL;
|
|
static int test_data_size=0;
|
|
|
|
static int test_write(struct inode * inode, struct file * file,
|
|
char * buffer, int count)
|
|
{
|
|
printk("Write %d bytes\n",count);
|
|
if (count>4095) return -ENOMEM;
|
|
if (test_data!=NULL) kfree_s((void *)test_data, test_data_size);
|
|
test_data_size = 0;
|
|
test_data = (char *)kmalloc((unsigned int)count, GFP_KERNEL);
|
|
if (test_data==NULL) return -ENOMEM;
|
|
memcpy_fromfs((void *)test_data, (void *)buffer, (unsigned long)count);
|
|
test_data[count]=0; /* NUL-terminate the string if necessary */
|
|
test_data_size = count;
|
|
return count;
|
|
}
|
|
|
|
========================================
|
|
|
|
Here, instead of statically allocating memory for the string, we dynamically
|
|
allocate it using kmalloc. Note first, that if we had already allocated
|
|
a string, we free it first by using kfree_s. This is faster than using
|
|
kfree, because kfree would have to search for the size of the object allocated.
|
|
Here we know what the size was, so we can use kfree_s. kmalloc vs. malloc
|
|
is discussed below.
|
|
|
|
Next, note that we use the GFP_KERNEL priority in the kmalloc. This causes
|
|
the process to go to sleep if there is no memory available, and the process
|
|
will wake up again when there is memory to spare. In general, the process
|
|
will sleep until a page of memory is swapped out to disk.
|
|
|
|
In the event of catastrophic memory non-availability, kmalloc will return
|
|
NULL, and we should handle that case. Unfortunately here, we have already
|
|
freed the previous string -- although that could be changed easily by
|
|
kmallocing, then kfreeing.
|
|
|
|
The rest of the code reads as in listing 4.
|
|
|
|
When we get into the section on interrupt handling, we will discuss the
|
|
use of GFP_ATOMIC as a kmalloc priority.
|
|
|
|
A brief excursion into kmalloc vs. malloc:
|
|
|
|
The malloc() call allocates memory in user space, which is fine if that's what
|
|
you want. Here, we want to have the driver store information so that *any*
|
|
process can use it, and so we have to allocate memory in the kernel. That
|
|
means, kmalloc(). Further, there is a maximum of 4096 bytes which can be
|
|
allocated in any one call of kmalloc. This means that you cannot be guaranteed
|
|
to get contiguous space of over 4096 bytes. You will have to use a linked
|
|
list of kmalloced buffers.
|
|
|
|
Alternatively, you can fool with the init section of the driver, and reserve
|
|
contiguous space for yourself on init (but then it may as well be statically
|
|
allocated).
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
CHAPTER FIVE: For my next trick, I...fall....a...sleep (SNNXXXX!!)
|
|
------------
|
|
|
|
The thing which really saves multitasking operating systems is that
|
|
many process sleep when waiting for events to occur. If this were not
|
|
true, processes would always be burning cycles, and there would really
|
|
be no big difference between running your processes at the same time,
|
|
or one after the other.
|
|
|
|
But when a process sleeps, other processes get to use the CPU. In general,
|
|
processes sleep when an event they are waiting for has not yet happened. The
|
|
exception to this is processes which are designed to do work when nothing
|
|
is happening. For example, you might have a process sitting around using
|
|
cycles to calculate pi out to a zillion digits. That kind of background
|
|
process should have its priority set real low so that it isn't executed
|
|
often when other (presumably more important) processes have work to do.
|
|
|
|
Since processes sleep when waiting for events, and said events are usually
|
|
handled by drivers, drivers must cause the processes which called them to
|
|
sleep if not ready. This is the idea behind the select() call, which will
|
|
be dealt with in a later chapter.
|
|
|
|
To illustrate sleeping and waking processes, we will alter our driver from
|
|
listing 2 by adding a new write function and changing the read function
|
|
around as follows:
|
|
|
|
========================================
|
|
File Listing 6 (partial): testdata.c
|
|
========================================
|
|
|
|
static char test_data[]="Linux is really funky!\n";
|
|
static int wakeups = 0;
|
|
static struct wait_queue *wait_queue = NULL;
|
|
|
|
static int test_write(struct inode * inode, struct file * file,
|
|
char * buffer, int count)
|
|
{
|
|
int i;
|
|
|
|
printk("Write %d bytes\n",count);
|
|
wake_up_interruptible(&wait_queue);
|
|
printk("Woke %d processes.\n",wakeups);
|
|
wakeups = 0;
|
|
return count;
|
|
}
|
|
|
|
static int test_read(struct inode * inode, struct file * file,
|
|
char * buffer, int count)
|
|
{
|
|
int offset;
|
|
|
|
printk("Test Data Generator, reading %d bytes\n",count);
|
|
|
|
printk("Process going to sleep\n");
|
|
wakeups++;
|
|
interruptible_sleep_on(&wait_queue);
|
|
printk("Process has woken up!\n");
|
|
|
|
for (offset=0; offset<count; offset++)
|
|
put_fs_byte(test_data[offset % (sizeof(test_data)-1)], buffer+offset);
|
|
return offset;
|
|
}
|
|
|
|
========================================
|
|
|
|
Don't forget to put the test_write function in the file_operations struct!
|
|
But don't compile this driver just yet! Read on...
|
|
|
|
The operation of this driver is as follows: On a read, put the process
|
|
to sleep. On a write, wake up all those processes which have gone to
|
|
sleep in this driver. This will allow the processes to complete the read.
|
|
|
|
There are two new variables here, wakeups and wait_queue. The wait_queue is
|
|
a circular queue of processes which are sleeping. It is FIFO, so that the
|
|
process woken up is the first process which went to sleep.
|
|
|
|
The kernel handles the queue for us; all we need to do is supply a pointer
|
|
to the queue and initialize it to NULL (i.e., the queue is empty).
|
|
|
|
We'll use the wakeups variable to tell us how many processes are taken off the
|
|
wait_queue (i.e., woken up) -- which is the number of processes which have
|
|
already gone to sleep. So each time a process is slept on, we increment
|
|
wakeups. When a write request comes in, we wake up wakeups processes and reset
|
|
wakeups to zero.
|
|
|
|
Simple, yes? Now we get into the sticky part.
|
|
|
|
In the Guide, you see that you can choose two ways of sleeping -- interruptible
|
|
or not. Interruptible sleeps can be interrupted (i.e., the process is woken
|
|
up) by signals (such as SIGUSR) and hardware interrupts. Non-interruptible
|
|
sleeps can only be interrupted by hardware interrupts. Not even a kill -9
|
|
will wake up a non-interruptible process which is sleeping! Suppose you have a
|
|
signal handler in your process which will react to signal 30 (SIGUSR).
|
|
That is, you can do kill -30 <pid>. What happens?
|
|
|
|
When the scheduler gets around to checking the signalled process for
|
|
runnability, it sees that there is a signal pending. This allows the process
|
|
to continue to run where it left off, with a twist: when the process leaves
|
|
kernel mode (the driver call) and enters user mode, the signal handler is
|
|
called (if there is one). Once the signal handler function exits, one of
|
|
two things can happen:
|
|
|
|
(1) If the original system call exited with -ERESTARTNOINTR,
|
|
then the process will continue as if it calls the system call again
|
|
with the same arguments.
|
|
(2) If the original system call did not exit with -ERESTARTNOINTR,
|
|
but with -ERESTARTNOHAND or -ERESTARTSYS, then the process will
|
|
continue exitting from the system call with -1, errno -EINTR.
|
|
(3) If the original system call did not exit with -ERSTARTNOINTR,
|
|
-ERESTARTNOHAND, or -ERESTARTSYS, then the process will continue,
|
|
exitting from the system call with whatever was returned.
|
|
|
|
You can see most of this (if you can read mutilated 80386 assembly) in
|
|
<src>/kernel/sys_call.S and <src>/kernel/signal.c. Although signal handling
|
|
has been considerably revamped for 0.99pl8, the basic sequence of operations
|
|
is intact across patch levels. -ERESTARTNOHAND is new in 0.99pl8.
|
|
|
|
This is important -- the driver call should not be completed except for
|
|
cleanup, since the kernel will return an error for you or redo the system
|
|
call.
|
|
|
|
When the process continues to run before calling the signal handler, it picks
|
|
up where it left off -- in the interruptible_sleep_on function. This function
|
|
takes the process off the wait_queue automatically (which is nice). But then
|
|
wakeups is not updated (which is not so nice). In that case, when a
|
|
subsequent write comes in, the number of sleeping processes reported will
|
|
be wrong!
|
|
|
|
[pulpit-pounding mode on]
|
|
|
|
Although for this driver ignoring this is not such a big deal, it is sloppy
|
|
programming for a driver. Driver code must be so perfect that it operates
|
|
like a well-oiled machine, with no slip-ups. One error -- one bit of code
|
|
that gets out of sync -- and you can at least annoy users and make them throw
|
|
up their hands in frustration, and at worst panic the kernel and make users
|
|
throw your code away in frustration! Also, there is nothing worse than
|
|
spending time debugging an application when the bug is in the driver, or
|
|
trying to code around a known driver flaw.
|
|
|
|
[pulpit-pounding mode off]
|
|
|
|
So how do we solve this out-of-sync problem?
|
|
|
|
Fact: ignoring interrupts, all processes are atomic when they are in the
|
|
kernel. That is, unless a process performs an operation which can sleep (like
|
|
the call to kmalloc we visited above), or a hardware interrupt comes in, the
|
|
flow of execution goes from entering the kernel to leaving the kernel, with no
|
|
time taken out to run anything else. This does not mean that the code in user
|
|
space gets to continue to run. If the process leaves the system call and
|
|
is not eligible to run, other processes may run and then later on the system
|
|
call appears to have returned to the process. More on that later.
|
|
|
|
That fact is good to know. It means that as long as we are sure upon entering
|
|
the test_write call that wakeups contains the correct number of sleeping
|
|
processes, test_write will work 100%. That is, unless a hardware interrupt
|
|
comes in which causes the driver to execute an interrupt handler, we are safe,
|
|
but here we have no such handler, and so we can ignore that for now. We will
|
|
deal with interrupts in a later chapter.
|
|
|
|
So we know that write doesn't really have to be changed. It's really the
|
|
read that we're concerned about. What we need to do is after we get out
|
|
of interruptible_sleep_on() we see if we were genuinely woken up through
|
|
a wakeup call, or if we were signalled. If we were signalled, then
|
|
we know that the write call wasn't the cause of the wakeup, and so we should
|
|
really decrement wakeups.
|
|
|
|
Now for some loose ends. Remember that upon signalling, the kernel only
|
|
flags the signal for the process, and sets the process to a runnable state.
|
|
That does not mean that it can run immediately. Another process may get
|
|
to run first, and that process may very well run the driver's write code,
|
|
waking up all processes. Of course, we can consider the signalled process
|
|
to be still asleep when it gets the signal, because it has not yet run its
|
|
signal handler. So when that other process gets to run the write code, the
|
|
number of sleeping processes is indeed correct, and wakeups is set to 0.
|
|
|
|
But now, when the signalled process is run again, the read code will attempt
|
|
to decrement wakeups, making it -1! The next write will display the wrong
|
|
number of sleeping processes!
|
|
|
|
One thing saves us -- the fact that we can detect in the read code that the
|
|
write code was executed, simply because wakeups is 0. Remember that wakeups
|
|
is incremented before the sleep, so it is guaranteed to be greater than 0 if
|
|
the write code was not executed before waking up because of a signal.
|
|
|
|
So if the write code was executed, it really does not make sense to decrement
|
|
wakeups, so we just say that only if wakeups is non-zero do we decrement.
|
|
|
|
To implement all this, add this code after the sleep:
|
|
|
|
========================================
|
|
File Listing 7 (partial): testdata.c
|
|
========================================
|
|
|
|
if (current->signal & ~current->blocked) /* signalled? */
|
|
{
|
|
printk("Process signalled.\n");
|
|
if (wakeups) wakeups--;
|
|
return -ERESTARTNOINTR; /* Will restart call automagically */
|
|
}
|
|
|
|
========================================
|
|
|
|
|
|
|
|
Now that you've got that straightened out, let's add some more confusion to
|
|
the mix. Suppose you're in the driver call, doing nice things, and then
|
|
all of a sudden a nasty timer interrupt (task switch possibility) comes in.
|
|
What now? Will there be a task switch? No. A RUNNING task in the kernel
|
|
cannot be switched out, otherwise all hell would break loose. Whew!
|
|
I'm glad we don't have to pay attention to that!
|
|
|
|
Well, now that we've gone through all the possible ways signals can make
|
|
your insides twist, you can code the driver. Remember to put listing 7 into
|
|
listing 6!
|
|
|
|
Here's how we're going to test this driver. Several processes will call
|
|
read (and sleep). When they wake up, they're going to say that they were
|
|
woken up (as opposed to printing out what they just read -- we already know
|
|
that works). One process will do a write to wake the other processes up.
|
|
This is the trigger process. Here is the code for the two types of processes:
|
|
|
|
========================================
|
|
File Listing 8: data.c
|
|
========================================
|
|
|
|
#include <stdio.h>
|
|
#include <sys/types.h>
|
|
#include <unistd.h>
|
|
#include <fcntl.h>
|
|
#include <errno.h>
|
|
#include <signal.h>
|
|
|
|
/* The reader process */
|
|
|
|
void signal_handler(int x)
|
|
{
|
|
printf("Called signal handler\n");
|
|
signal(SIGUSR1, signal_handler); /* Reset signal handler */
|
|
}
|
|
|
|
void main(void)
|
|
{
|
|
int fd;
|
|
char buff[128];
|
|
int rtn;
|
|
|
|
signal(SIGUSR1, signal_handler); /* Setup signal handler */
|
|
|
|
fd = open("/dev/testdata",O_RDWR);
|
|
printf("/dev/testdata opened, fd=%d\n",fd);
|
|
if (fd<=0) exit(0);
|
|
|
|
rtn = read(fd,buff,sizeof(buff));
|
|
printf("Read returns %d\n",rtn);
|
|
if (rtn<0)
|
|
{
|
|
perror("read");
|
|
exit(1);
|
|
}
|
|
printf("Process woken up!\n");
|
|
close(fd);
|
|
}
|
|
|
|
========================================
|
|
|
|
========================================
|
|
File Listing 9: trigger.c
|
|
========================================
|
|
|
|
#include <stdio.h>
|
|
#include <sys/types.h>
|
|
#include <unistd.h>
|
|
#include <fcntl.h>
|
|
#include <errno.h>
|
|
#include <signal.h>
|
|
|
|
/* The writer process */
|
|
|
|
void main(int argc, char **argv)
|
|
{
|
|
int fd;
|
|
char buff[128];
|
|
int rtn;
|
|
|
|
fd = open("/dev/testdata",O_RDWR);
|
|
printf("/dev/testdata opened, fd=%d\n",fd);
|
|
if (fd<=0) exit(0);
|
|
|
|
if (argc>1)
|
|
{
|
|
kill(atoi(argv[1]),SIGUSR1);
|
|
exit(0);
|
|
}
|
|
rtn = write(fd,buff,sizeof(buff));
|
|
if (rtn<0)
|
|
{
|
|
perror("write");
|
|
exit(1);
|
|
}
|
|
close(fd);
|
|
}
|
|
|
|
========================================
|
|
|
|
Compile these programs using gcc. Now run two or three of the data processes:
|
|
|
|
data &
|
|
|
|
The last thing each of these processes should print is
|
|
|
|
Process going to sleep.
|
|
|
|
because all of these processes are asleep. Now run the trigger program:
|
|
|
|
trigger
|
|
|
|
This should wake up all the other processes, which should say,
|
|
|
|
Process woken up!
|
|
|
|
Had the read function returned an error (like EINTR), they would have said
|
|
|
|
read: <error text>
|
|
|
|
Now, let's test to see if the signal detection and restart mechanism works.
|
|
Run a single data process in the background via "data &". Remember it's
|
|
pid. Now, run the trigger process with that pid as an argument:
|
|
|
|
trigger <pid>
|
|
|
|
This will signal <pid> instead of waking it up via write. The driver should
|
|
say,
|
|
|
|
Process signalled.
|
|
Called signal handler
|
|
|
|
but the process should not wake up, since we restarted the call. Only a
|
|
write will stop the call.
|
|
|
|
++++++++++++++++++++++
|
|
EXPERIMENT 3
|
|
++++++++++++++++++++++
|
|
|
|
Re-write the driver so that instead of always restarting the call, it returns
|
|
with EINTR on signal when the read call's count is a special value or values
|
|
(say anything less than 1000). Test to see if the read call returns EINTR when
|
|
the trigger program signals the reading process.
|
|
|
|
++++++++++++++++++++++
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
CHAPTER SIX: I want this, that, that...no, THIS, and that. Or, selects!
|
|
-----------
|
|
|
|
The select call is one of the most useful calls created for interfacing to
|
|
drivers. Without it, or a function like it, if you wanted to check a
|
|
driver for readiness, you would have to poll it regularly. Worse, you
|
|
would not be able to check multiple drivers for readiness at the same time!
|
|
|
|
But enough of this. You have select, so rejoice and be happy.
|
|
|
|
As already implied by the first paragraph, the select system call allows
|
|
a process to check multiple drivers for readiness. For example, suppose
|
|
you wanted the process to sit around and wait for one of two file
|
|
descriptors to be ready for reading. Usually, if a descriptor is not ready
|
|
for reading and you read it, it will put your process to sleep (or "block").
|
|
But you can only read one file descriptor at a time, and here you want to
|
|
essentially block on _two_ fd's.
|
|
|
|
In that case, you use the select call. The syntax of select was already
|
|
explained in the Guide, so let's go about implementing a select function in
|
|
our driver.
|
|
|
|
Add the following code to the driver, and put the test_select function in
|
|
the fops structure:
|
|
|
|
========================================
|
|
File Listing 10 (partial): testdata.c
|
|
========================================
|
|
|
|
static int test_select(struct inode *inode, struct file *file,
|
|
int sel_type, select_table *wait)
|
|
{
|
|
printk("Driver entering select.\n");
|
|
if (sel_type==SEL_IN) /* ready for read? */
|
|
{
|
|
if (wakeups) /* Any process is sleeping in here */
|
|
{
|
|
select_wait(&wait_queue, wait);
|
|
printk("Driver not ready\n");
|
|
return 0; /* Not ready yet */
|
|
}
|
|
return 1; /* Ready */
|
|
}
|
|
return 1; /* Always ready for writes and exceptions */
|
|
}
|
|
|
|
========================================
|
|
|
|
Here's what this function does. When a process issues a select call with
|
|
this driver as one of the fd's to select on, the kernel will call
|
|
test_select with sel_type being SEL_IN. If wakeups is non-zero (that is,
|
|
processes have read without a process writing) then we will say that the
|
|
driver is not ready for reading. In this case, select_wait will add the
|
|
process to the wait_queue and immediately return. The return of 0 indicates
|
|
that the driver is not ready for the operation.
|
|
|
|
For any other type of operation (or if there are no processes sleeping in
|
|
read) we say the driver is ready (return 1).
|
|
|
|
The only thing that must be remembered is that we are using the same
|
|
wait_queue structure for processes sleeping in read and processes sleeping
|
|
in select. This means that writing to the driver will wake up both types of
|
|
processes. If desired, a different wait_queue could be used, and the
|
|
appropriate wake up code would have to be written.
|
|
|
|
Compile this new code into the kernel. We will test this driver by writing
|
|
a new type of process which will call the select system call. Here is the
|
|
new process' code:
|
|
|
|
========================================
|
|
File Listing 11: sel.c
|
|
========================================
|
|
|
|
#include <stdio.h> /* Doesn't hurt, can only help! */
|
|
#include <fcntl.h>
|
|
#include <sys/time.h> /* For FD_* and select */
|
|
|
|
void main(void)
|
|
{
|
|
int fd;
|
|
int rtn;
|
|
fd_set read_fds;
|
|
|
|
fd = open("/dev/testdata", O_RDWR);
|
|
printf("/dev/testdata opened, fd=%d\n",fd);
|
|
if (fd<=0) exit(0);
|
|
printf("Entering select...\n");
|
|
FD_ZERO(&read_fds);
|
|
FD_SET(fd,&read_fds);
|
|
rtn = select(&read_fds, NULL, NULL, NULL);
|
|
if (rtn<0)
|
|
{
|
|
perror("select");
|
|
exit(0);
|
|
}
|
|
printf("Select returns %d\n",rtn);
|
|
}
|
|
|
|
========================================
|
|
|
|
When the kernel is re-loaded, the first test we will perform is to see
|
|
whether the select call returns immediately given that no processes are
|
|
sleeping in read. Just run sel -- no need to run it in the background.
|
|
You should see something like:
|
|
|
|
Entering select...
|
|
Driver entering select.
|
|
Select returns 1
|
|
|
|
This is as it should be -- select has determined that one file descriptor is
|
|
ready for reading.
|
|
|
|
Our next test is to see whether select sleeps properly. Run this:
|
|
|
|
data &
|
|
sel &
|
|
trip
|
|
|
|
When sel is run, you should see:
|
|
|
|
Entering select...
|
|
Driver entering select.
|
|
Read not ready
|
|
Driver entering select.
|
|
Read not ready
|
|
|
|
The select call in the kernel calls the test_select function again once if
|
|
the first time the driver is not ready. However, the process is only added
|
|
to the wait queue once -- the first time.
|
|
|
|
Once the trip program is run, you should see:
|
|
|
|
Process has woken up!
|
|
Read returns 1024
|
|
Driver entering select.
|
|
Select returns 1
|
|
|
|
That is, the data process woke up due to the write, as did the sel process.
|
|
Note that the test_select function is called once again when the sel process
|
|
is woken up. This is also a consequence of the kernel design, and is
|
|
nothing to worry about. Those who are interested in the inner workings of
|
|
the select call should look in the file <src>/fs/select.c.
|
|
|
|
A word about signals and select. Since the select call in the driver does
|
|
not return any error code -- just 0 or non-0 -- there is no way to decide
|
|
whether the select call should be restarted or not. Select will return -1,
|
|
errno EINTR if interrupted by a signal.
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
CHAPTER SEVEN: This next chap -- oh, hello! -- this next chapter is about
|
|
------------- interrupts.
|
|
|
|
This chapter will be one of the most difficult chapters to go through as a
|
|
tutorial, since some means of generating interrupts must be used to test
|
|
things with. Furthermore, the interrupt must be one which is currently
|
|
unused by the system, and one must be willing to mess around with a hardware
|
|
device which is connected to the IRQ.
|
|
|
|
I will start out with something more controlled than external interrupts --
|
|
internal, or software, interrupts.
|
|
|
|
Why internal interrupts? There really is not such a big difference between
|
|
internal and external interrupts. Certainly an IRQ is generated by a
|
|
hardware device, but the hardware IRQ results in a software interrupt. I
|
|
will discuss the required changes for dealing with hardware rather than
|
|
software interrupts later in this chapter.
|
|
|
|
Note: The following paragraphs deal with 80386/80486 specific stuff. Those
|
|
who are not really interested in the "why" of Linux interrupts may skip
|
|
ahead!
|
|
|
|
To be able to use interrupts, we must first understand how Linux handles
|
|
interrupts. Interrupts most often require a transfer of execution control
|
|
from one code segment to another, and this may be accomplished in two ways.
|
|
The first is by specifying the descriptor of the other executable segment,
|
|
and the second is by a "gate".
|
|
|
|
In Linux, three functions are used to initialize gates: set_intr_gate,
|
|
set_trap_gate, and set_system_gate.
|
|
|
|
set_intr_gate sets up a 32-bit interrupt gate with descriptor privilege
|
|
level (DPL) 0 (the most privileged level).
|
|
|
|
set_trap_gate sets up a 32-bit task gate with DPL 0.
|
|
|
|
set_system_gate sets up a 32-bit task gate with DPL 3.
|
|
|
|
Each of these setups enter the gate into the interrupt descriptor table
|
|
(IDT) so that when an INT n instruction is performed, the gate in the IDT
|
|
corresponding to n is executed.
|
|
|
|
THIS ENDS 80386/80486 DISCUSSION.
|
|
|
|
The three Linux calls allow us to install an interrupt handler for any
|
|
interrupt from 0x00 to 0xFF. We will use set_intr_gate to install an
|
|
interrupt handler into interrupt 0x90.
|
|
|