How To Add a System Call to Linux on an i386

by Chad C. D. Clark < frink @ thepurplebuffalo . net >

UPDATE: 02 Feb 2004: Fixed a mistake in unistd.h, thanks Noel. Also it has been noted that this file is a bit out of date. I based it on kernel 2.2.6 and there have been a couple changes since then. ;^) Still I think it should help point you in the right direction to add and understand system calls in linux. It would be nice to bring this document up to date and clean it up a bit but I'm not sure if/when I will get around to it.

$Revision: 1.5 $ $Date: 2003/03/05 06:48:30 $ $Author: frink $
The most current version will be found on SuperFrink.Net.

Intro

Distribution of this work is to be unlimited provided that credit is given to the author. Mirroring of this work is perfectly welcome. I am interested in hearing from anyone who finds it useful or mirrors it.

There seems to be some sort of feeling that operating systems are these great mysterious things that somehow perform magic. People seem scared to look inside. As it turns out the kernel is just one (very big and complicated) C program. (mostly C :) In the words of a friend "it's not magic, just sorcery."

Thanks to Linus for giving his kernel to the world. Thanks to my instructor for letting my operating systems class do this on a Sparc as a lab and for a great kernel filled course. Thanks of course to all of the developers who have come, gone and shared their work. I hope there will be many more.

Thanks also to Noel C. F. Codella for pointing out a mistake in the #define lines in unistd.h.

Linux and System Calls

When we write a C program on Linux and use library functions we don't often think about how they do their job. We just use printf() and trust things will work fine.

What happens is that our program (through the library functions) asks the operating system (ie the linux kernel) to do things for it. Things like Input/Ouput operations or cloning a new thread. The kernel then performs whatever our request was or perhaps denies us based on what our user-id is.

So how does this happen? Well there is a list of things we can request from the system. This list consists of a whole bunch of system calls. Each system call has it's own identifing number. When we want to use a system call we place the system call number into the EAX register and generate a trap. This trap is accomplished by the INT 0x80 assembly instruction. A trap is basically a software interrupt. Arguments are passed to the system call via registers. One should note that this is different from a typical function call where we use the stack to pass arguments.

Here is an example assembly file that shows a library function being used:

    .data
    .MSG:	.string	"Sonata #3"
    
    .text
    .align 4
    
    .globl main
    main:
    	pushl %ebp		# save base pointer
    	movl %esp,%ebp
    	pushl %ebx		# some wierd reason (linux convention)
    
    	pushl $.MSG		# store pointer to message on the stack
    	call puts		# make function call
    	addl $4, %esp		# clear pointer from top of stack
    
    	popl %ebx		# some wierd reason
    	movl %ebp, %esp		# restore base pointer
    	popl %ebp
    	ret

An aside: as assembly programmers we are allowed to trash EAX, ECX and, EDX but our programs must not change EBX. I don't know why, please enlighten me.

The main thing we want to see is that the arguments to C functions are stored on the stack. We don't see the system call being used here because it is done by code in the C library. We can see what system calls a process uses with strace. You just pass the command to strace as an argument like:

    strace ./a.out 2> log

Then looking at the log file we see the line:

    write(1, "Sonata #3\n", 10)             = 10

This shows that when we use puts() the library still has to make the request to the operating system via the write() call. strace outputs the system calls as if they were a C function. This makes them really easy to read. The arguments to write() are:

1 - the file descriptor for standard out.
"Sonata #3\n" - the string to print, puts() adds on the trailing "\n".
Also the argument is really a pointer to the string, strace makes it pretty.
10 - the number of bytes to print.

In C we don't have access to system calls because we can't manipulate registers. (You can use write() in C with the same arguments but the compiler still uses a library call. Try using gcc's "-S" flag to see. There will be more on wrapper functions later.) Here is an example assembly file that shows a system call being used, only a few lines differ:

    .data
    .MSG:   .string "Sonata #3\n"
    
    .text
    .align 4
    
    .globl main
    main:
            pushl %ebp		# save base pointer
            movl %esp,%ebp
            pushl %ebx		# some wierd reason (linux convention)
    
            movl $4, %eax	# call no. 4 is write()
            movl $1, %ebx	# stdout is file descriptor no. 1
            movl $.MSG, %ecx	# pointer to character array we want to print
            movl $10, %edx	# number of bytes we want to print
            int  $0x80
    
            popl %ebx		# some wierd reason
            movl %ebp, %esp	# restore base pointer
            popl %ebp
            ret

This time the big things to notice are that the arguments are passed in processor registers and that there is an interupt (int $0x80) instead of a function call.

FIXME : explain changing to kernel mode and the different stacks.
FIXME : also must explain that kernel builds a nice stack for our sys_ function. (later?)

Call Implementation

So now we are ready to code our first system call. I like to put my system call source code files in /usr/src/linux/kernel because they a part of the kernel.

For symplicity (and not ego) I decided to call my first system call ever chad. This is what the file /usr/src/linux/kernel/chad.c looks like:


	#include < linux/chad.h >

	asmlinkage int sys_chad(void) {
	
	        return(314);

	}

We can see that there isn't much to it. This system call just returns 314 and doesn't really do anything more. Oh, it enter's kernel space and executes with permision to do what ever it wants to the system.

A couple things to note:

First we need to use the asmlinkage int. We will see why later.
Second even though the system call is named chad we must name it sys_chad. In a way this makes sense because it is a system call. All of the system calls start with the prefix sys_.
Aslo we include a header file named in sync with this system call as linux/chad.h. The linux/ is needed because chad.c exists in /usr/src/linux/include/linux not just /usr/src/linux/include.

Adding a Library Function

You may have used the read() or write() system calls directly in a C program before without using printf(). It turns out what you (and I) thought was the real system call was in reality a wrapper function. It turns out that C is a middle level language and we don't acctually have control over the register contents directly. Recall that system calls are made by putting values into registers. We can use assembly to do this for us. < arch-i386/unistd.h > has a macro that we can use to create a wrapper function for us.

Having a look at /usr/src/linux/include/linux/chad.h we see:


	#ifndef __LINUX_CHAD_H
	#define __LINUX_CHAD_H

	#include < linux/linkage.h >
	#include < linux/unistd.h >


	_syscall0(int, chad)


	#endif

The #ifndef, #define and, #endif lines are just there to say "If when compiling we have not seen this file then read it, otherwise skip it".

The line we are really interested in is _syscall0(int, chad).

The _syscall part means that this line is to be translated to a system call.
The 0 means that this system call takes zero arguments.
The first field we encounter is int. This is the return type.
Next we see chad. This is the system call name.
Arguments to this macro come in pairs. Each pair consists of a return type and a name. Later in a second example I will show arguments passed to a system call. Trust me it will be worth the wait!

Getting a System Call Number

Remember that each system call needs to be referenced by a number passed throught the EAX register. Here is how we assign a number to our system call.

Open up /usr/src/linux/include/asm-i386/unistd.h. We find a list of #define's that assign numbers to system calls. The first one looks like this:

  	#define __NR_exit	 	  1

At the bottom of the list add a line like:

  	#define __NR_chad 		191

System Call Table Entry

Have a look in /usr/src/linux/arch/i386/kernel/entry.S. Way down at the end of the file is a long table that starts with the line ENTRY(sys_call_table). The table then consists of a whole bunch of entries like .long SYMBOL_NAME(sys_exit). This table holds a list containing each system call. In fact each line says use 4 bytes to hold a pointer to the label specified by SYMBOL_NAME. You may notice that this table of pointers could be seen as a array and that the system call number could work as the index into the array. I think it is pretty slick. Also notice the counter in comments off to the right every 5 calls.

Go way down to the bottom of the table (about 190 on kernel 2.2.6). We need to add a reference to our own call so just copy the last line that has the .long SYMBOL_NAME(...) format. Next change the sys_... part of the copy so that it has the name of our new system call (sys_chad). The new line will look like this:

  	.long SYMBOL_NAME(sys_chad)		/* added by chad */

I like to comment where I've been in the kernel so I can go back and change things I did (grep is a great tool).

Now before we leave this file look down just a couple lines and notice these lines:


	/*
	 * NOTE!! This doesn't have to be exact - we just have
	 * to make sure we have _enough_ of the "sys_ni_syscall"
	 * entries. Don't panic if you notice that this hasn't
	 * been shrunk every time we add a new system call.
	 */
	.rept NR_syscalls-190
		.long SYMBOL_NAME(sys_ni_syscall)
	.endr

What is happening is that the end of the system call table is being padded with references to a safe system call. Just imagine what could happen if this didn't happen and we passed a system call number that amounted to an index into uninitialized memory. Who knows what value that entry would point to.

So we just change the number of used system calls to reflect our new entry. In this case the line:

	.rept NR_syscalls-190

becomes:

	.rept NR_syscalls-191

Like the comments point out this isn't really necessary but we like to write clean code anyway. Save this file and close it. We should be done here.

Updating the Makefile

We created a new C file that will need to be compiled and linked into the kernel. The file was /usr/src/linux/kernel/chad.c so we need to edit the appropriate makefile (/usr/src/linux/kernel/Makefile).

Open the makefile and find the lines that start with O_OBJS =.

	O_OBJS    = sched.o dma.o fork.o exec_domain.o panic.o printk.o sys.o \
		    module.o exit.o itimer.o info.o time.o softirq.o resource.o \
		    sysctl.o acct.o capability.o

This is a list of the files that need to be linked into the kernel when we compile it. We can just add the following line right afterwards which says to also include our new file. Don't worry that chad.o doesn't exist yet. It will be created when we compile the kernel.

	O_OBJS   += chad.o

Compiling the kernel

This part I'm assuming you have already done before. You don't have to worry about changing the .config (via make config etc). Just do a make dep and a make bzImage (or zdisk, etc). Don't forget to make modules if you use modules. Also I can't stress how important it is to keep a backup kernel for just in case.

When compile time errors come up try to read and understand the message. I 'm not sure I can give anymore addvice than that. Good Luck!!

A second example

FIXME - incomplete !!

Okay first of all this system call is dangerous !! DO NOT put this system call on any machine anyone other than yourself has access to. I have disabled this code from my kernel and I am the only one with an account on my machine.

This system call allows any user to change the owner of a running process to ID zero. That means that any process can become a root owned process. I have used this system call on an instance of bash and the prompt changed from a $ to a # as soon as a new command prompt was displayed.

I was tempted to not include this part of the file for distribution because I fear that someone could put it into some rootkit somewhere. Still this was the second system call I ever wrote so I imagine if someone were to get this far in the instructions and refered to one of a couple of the following resources one could figure this out anyway so here goes.
(I hope someday I don't end up regretting this.)


#include <linux/steal.h>
#include <linux/sched.h>         /* task_struct */  
#include <unistd.h>

asmlinkage int sys_steal(pid_t shid) {

return(-0xFF);
/* commented out due to security concerns */
/*
        struct task_struct *tsk_p;

        tsk_p = &init_task;
        tsk_p = tsk_p->next_task;

    // not untill after did I discover find_task_by_pid() -chad
        while(tsk_p->pid != shid) {
                if (tsk_p == &init_task) return(-271);
                tsk_p = tsk_p->next_task;
        }

        tsk_p->uid = (uid_t) 0;
        tsk_p->euid = (uid_t) 0;

        return(314);
*/
}


/* this file created by chad 
 */

#ifndef __LINUX_STEAL_H
#define __LINUX_STEAL_H

#include <linux/linkage.h>
#include <linux/unistd.h>
#include <unistd.h>

_syscall1(int, steal, pid_t, shid)

#endif

Useful Resource's

Linux Kernel Internals: Beck, Bohme, Dziadzka, Kunitz, Magnus, Verworner
Understanding the Linux Kernel: Bovet & Cesati
Kernel Projects for Linux: Nutt, Gary
Linux Device Drivers: Rubini
www.LinuxDoc.org
www.kernelnewbies.org
Linux system call table
Paul "Rusty" Russell's Unreliable Guides