Linux UserSpace Signal Logging
(Logs signals sent from userspace programs.)
by Chad Clark (Version 10 March 2003)
The Beginning
I was dealing with a program that was dying from signal 6 (SIGABRT). I had
to stop the server from restarting every 3 to 10 minutes and wanted to trace
where the signals were coming from. A quick kernel patch and things were on
the way to being fixed, so I thought.
The Idea
What I did was log [ via printk() ] when a signal was sent by a user process.
This was accomplished by adding a printk() to sys_kill() [ see
kill in section two of the man pages ]. I wanted to log the process id and
name of both the sending and receiving processes. Sometimes the receiver
won't have a name because it is a process group or job so that had to be taken
care of as well.
In order to keep the logging specific and flexible I added a table to store
the list of signals which were to be logged [ logged_signals ].
This is a simple array indexed by signal number where a non-zero value means
for the signal to be logged. Then I added a system call so I could change
the signals that are logged.
The Code
- /usr/src/linux/kernel/signal.c was changed a bit to hold the
table and sys_kill was patched. Also I stashed the new system
call in there. The new code can be found HERE.
- /usr/src/linux/include/asm-i386/unistd.h had one line added
to it. HERE. This just gives the new system call
a number.
- /usr/src/linux/arch/i386/kernel/entry.S also got a new line.
See it HERE. This just adds a reference to the
system call.
The System Call
sys_siglog takes two arguments: a signal number and an action. The
action is either a zero to disable logging of that signal or a non-zero value
to enable logging of the passed in signal.
Testing
- Step 1: Setup syslogd.
Edit /etc/syslog.conf to send *.debug to
/var/log/debug. This just takes a line like:
*.debug /var/log/debug
- Step 2: The test program.
siglog.c is a quick and dirty C program that takes
two arguments on the command line. These arguments are just turned into
int's and then passed to the system call. There is no error
checking done on the arguments but sys_siglog isn't to picky.
Note you do have to be root to change which signals are logged.
Compile the test program with gcc siglog.c -o siglog
- Step 3: Testing.
Assuming a make dep && make bzImage ; echo -e "\a" and that the
new kernel boots and such then fire up a sleep 100 on one
virtual terminal and signal with kill -4 from a second tty.
/var/log/debug should have an entry like:
Apr 30 06:04:50 forbidden kernel: Sig: 4 from pid: 44 (bash) to pid: 61 (sleep)
A kill -1 66 (on a new sleep :) should generate something like:
Apr 30 06:05:42 forbidden kernel: Sig: 1 from pid: 44 (bash) to pid: 66 (sleep)
(By the way "forbidden" is the machine's hostname.)
Now mask out signal 4 via ./siglog 4 0 .
Then try kill -4 PID again and check the logfile
once more. You should not see an entry.
Lastly try kill -1 PID once more. This time you
should again see a log entry.
Now turn signal 4 logging back on via ./siglog 4 1 and
check to see that both signals 1 and 4 are logging entries.
The Output
Some typical entries including suspending a process and booting up
will look like:
Apr 30 07:56:05 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 61 (a.out)
Apr 30 07:56:22 forbidden kernel: Sig: 18 from pid: 44 (bash) to pid: -62
Apr 30 07:57:08 forbidden kernel: Sig: 11 from pid: 44 (bash) to pid: -62
Apr 30 07:57:17 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 63
Apr 30 07:57:20 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 62 (sleep)
Apr 30 07:57:24 forbidden kernel: Sig: 18 from pid: 44 (bash) to pid: -62
Apr 30 08:00:36 forbidden kernel: Sig: 15 from pid: 88 (poff) to pid: 74 (pppd)
Apr 30 08:07:29 forbidden kernel: Sig: 19 from pid: 94 (vi) to pid: 0
Apr 30 08:07:58 forbidden kernel: Sig: 18 from pid: 92 (bash) to pid: -94
Apr 30 08:10:23 forbidden kernel: Sig: 19 from pid: 96 (vi) to pid: 0
Apr 30 08:10:26 forbidden kernel: Sig: 18 from pid: 65 (bash) to pid: -96
Apr 30 08:33:07 forbidden kernel: Sig: 19 from pid: 503 (vi) to pid: 0
Apr 30 08:33:41 forbidden kernel: Sig: 18 from pid: 65 (bash) to pid: -503
Apr 30 08:41:08 forbidden kernel: Sig: 19 from pid: 711 (vi) to pid: 0
Apr 30 08:41:17 forbidden kernel: Sig: 18 from pid: 65 (bash) to pid: -711
Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -43
Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -730
Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -45
Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -46
Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -47
Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -48
Apr 30 08:43:18 forbidden kernel: Sig: 15 from pid: 30 (syslogd) to pid: 29 (syslogd)
Apr 30 08:44:34 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 63 (bigloop)
Apr 30 08:46:11 forbidden kernel: Sig: 4 from pid: 44 (bash) to pid: 80 (bigloop)
Apr 30 08:47:20 forbidden kernel: Sig: 4 from pid: 44 (bash) to pid: 84 (bigloop)
The End
So did we get the SIGABRT traced down? Sort of. It turns out the
signal 6 was self-inflicted by a signal handler.
What signal handler was that? The signal handler for signal 11 (SEGV).
Now where is this signal 11 coming from? What does this logfile show?
Nothing. This kernel patch just logs signals sent via the kill
system call so we can only see signals sent by user processes. The
signal 11 is occurring in the kernel somewhere. It's most likely a
bad pointer.
Personally I've seen sig 11's pop up from three causes:
- Accessing unmapped memory. (ie a bad pointer)
- Running out of stack space. (8 Mega Bytes by default)
- Flaky hardware.
- Okay four causes:
Intentionally sending a signal 11 with kill().
I bumped the stack space up to 32 Megs but the application still crashes
and I don't have any problems with other programs crashing like this on
that busy machine. It looks like the cause is problems with the server
program.
The vendor claimed that upgrading to the enterprise solution would keep
the problems at bay. It turns out the (aprox. $4000) "upgrade" was a
new serial number, and NOT a new executable, which enabled advanced
features in the software but the segfaults are still a problem.