There seems to be some sort of feeling that operating systems are these great mysterious things that somehow perform magic. People seem scared to look inside. As it turns out the kernel is just one (very big and complicated) C program. (mostly C :) In the words of a friend "it's not magic, just sorcery."
Thanks to Linus for giving his kernel to the world. Thanks to my instructor for letting my operating systems class do this on a Sparc as a lab and for a great kernel filled course. Thanks of course to all of the developers who have come, gone and shared their work. I hope there will be many more.
Thanks also to Noel C. F. Codella for pointing out a mistake in the #define lines in unistd.h.
What happens is that our program (through the library functions) asks the operating system (ie the linux kernel) to do things for it. Things like Input/Ouput operations or cloning a new thread. The kernel then performs whatever our request was or perhaps denies us based on what our user-id is.
So how does this happen? Well there is a list of things we can request from
the system. This list consists of a whole bunch of system calls. Each
system call has it's own identifing number. When we want to use a system
call we place the system call number into the EAX register and generate a
trap. This trap is accomplished by the
INT 0x80 assembly instruction. A trap is basically a software
interrupt. Arguments are passed to the system call via registers. One
should note that this is different from a typical function call where we
use the stack to pass arguments.
Here is an example assembly file that shows a library function being used:
.data .MSG: .string "Sonata #3" .text .align 4 .globl main main: pushl %ebp # save base pointer movl %esp,%ebp pushl %ebx # some wierd reason (linux convention) pushl $.MSG # store pointer to message on the stack call puts # make function call addl $4, %esp # clear pointer from top of stack popl %ebx # some wierd reason movl %ebp, %esp # restore base pointer popl %ebp retAn aside: as assembly programmers we are allowed to trash EAX, ECX and, EDX but our programs must not change EBX. I don't know why, please enlighten me.
strace ./a.out 2> logThen looking at the log file we see the line:
write(1, "Sonata #3\n", 10) = 10This shows that when we use puts() the library still has to make the request to the operating system via the write() call. strace outputs the system calls as if they were a C function. This makes them really easy to read. The arguments to write() are:
In C we don't have access to system calls because we can't manipulate
registers.
(You can use write()
in C with the same arguments but the compiler still uses a library call.
Try using gcc's "-S" flag to see. There will be more
on wrapper functions later.)
Here is an example assembly file that shows a system call being used,
only a few lines differ:
.data .MSG: .string "Sonata #3\n" .text .align 4 .globl main main: pushl %ebp # save base pointer movl %esp,%ebp pushl %ebx # some wierd reason (linux convention) movl $4, %eax # call no. 4 is write() movl $1, %ebx # stdout is file descriptor no. 1 movl $.MSG, %ecx # pointer to character array we want to print movl $10, %edx # number of bytes we want to print int $0x80 popl %ebx # some wierd reason movl %ebp, %esp # restore base pointer popl %ebp retThis time the big things to notice are that the arguments are passed in processor registers and that there is an interupt (int $0x80) instead of a function call.
For symplicity (and not ego) I decided to call my first system call ever
chad. This is what the file /usr/src/linux/kernel/chad.c
looks like:
#include < linux/chad.h > asmlinkage int sys_chad(void) { return(314); }
A couple things to note:
Having a look at /usr/src/linux/include/linux/chad.h we see:
#ifndef __LINUX_CHAD_H #define __LINUX_CHAD_H #include < linux/linkage.h > #include < linux/unistd.h > _syscall0(int, chad) #endif
The #ifndef, #define and, #endif lines are just there to say "If when compiling we have not seen this file then read it, otherwise skip it".
The line we are really interested in is _syscall0(int, chad).
Open up /usr/src/linux/include/asm-i386/unistd.h. We find a list of #define's that assign numbers to system calls. The first one looks like this:
#define __NR_exit 1
At the bottom of the list add a line like:
#define __NR_chad 191
Go way down to the bottom of the table (about 190 on kernel 2.2.6). We need to add a reference to our own call so just copy the last line that has the .long SYMBOL_NAME(...) format. Next change the sys_... part of the copy so that it has the name of our new system call (sys_chad). The new line will look like this:
.long SYMBOL_NAME(sys_chad) /* added by chad */
I like to comment where I've been in the kernel so I can go back and change things I did (grep is a great tool).
Now before we leave this file look down just a couple lines and notice
these lines:
/* * NOTE!! This doesn't have to be exact - we just have * to make sure we have _enough_ of the "sys_ni_syscall" * entries. Don't panic if you notice that this hasn't * been shrunk every time we add a new system call. */ .rept NR_syscalls-190 .long SYMBOL_NAME(sys_ni_syscall) .endr
What is happening is that the end of the system call table is being padded with references to a safe system call. Just imagine what could happen if this didn't happen and we passed a system call number that amounted to an index into uninitialized memory. Who knows what value that entry would point to.
So we just change the number of used system calls to reflect our new entry. In this case the line:
.rept NR_syscalls-190becomes:
.rept NR_syscalls-191
Like the comments point out this isn't really necessary but we like to write clean code anyway. Save this file and close it. We should be done here.
Open the makefile and find the lines that start with O_OBJS =.
O_OBJS = sched.o dma.o fork.o exec_domain.o panic.o printk.o sys.o \ module.o exit.o itimer.o info.o time.o softirq.o resource.o \ sysctl.o acct.o capability.o
This is a list of the files that need to be linked into the kernel when we compile it. We can just add the following line right afterwards which says to also include our new file. Don't worry that chad.o doesn't exist yet. It will be created when we compile the kernel.
O_OBJS += chad.o
When compile time errors come up try to read and understand the message. I
'm not sure I can give anymore addvice than that. Good Luck!!
#include <linux/steal.h> #include <linux/sched.h> /* task_struct */ #include <unistd.h> asmlinkage int sys_steal(pid_t shid) { return(-0xFF); /* commented out due to security concerns */ /* struct task_struct *tsk_p; tsk_p = &init_task; tsk_p = tsk_p->next_task; // not untill after did I discover find_task_by_pid() -chad while(tsk_p->pid != shid) { if (tsk_p == &init_task) return(-271); tsk_p = tsk_p->next_task; } tsk_p->uid = (uid_t) 0; tsk_p->euid = (uid_t) 0; return(314); */ }
/* this file created by chad */ #ifndef __LINUX_STEAL_H #define __LINUX_STEAL_H #include <linux/linkage.h> #include <linux/unistd.h> #include <unistd.h> _syscall1(int, steal, pid_t, shid) #endif