Linux UserSpace Signal Logger : SuperFrink.Net

Linux UserSpace Signal Logging

(Logs signals sent from userspace programs.)

by Chad Clark (Version 10 March 2003)

The Beginning

I was dealing with a program that was dying from signal 6 (SIGABRT). I had to stop the server from restarting every 3 to 10 minutes and wanted to trace where the signals were coming from. A quick kernel patch and things were on the way to being fixed, so I thought.

The Idea

What I did was log [ via printk() ] when a signal was sent by a user process. This was accomplished by adding a printk() to sys_kill() [ see kill in section two of the man pages ]. I wanted to log the process id and name of both the sending and receiving processes. Sometimes the receiver won't have a name because it is a process group or job so that had to be taken care of as well.

In order to keep the logging specific and flexible I added a table to store the list of signals which were to be logged [ logged_signals ]. This is a simple array indexed by signal number where a non-zero value means for the signal to be logged. Then I added a system call so I could change the signals that are logged.

The Code

/usr/src/linux/kernel/signal.c was changed a bit to hold the table and sys_kill was patched. Also I stashed the new system call in there. The new code can be found HERE.
/usr/src/linux/include/asm-i386/unistd.h had one line added to it. HERE. This just gives the new system call a number.
/usr/src/linux/arch/i386/kernel/entry.S also got a new line. See it HERE. This just adds a reference to the system call.

The System Call

sys_siglog takes two arguments: a signal number and an action. The action is either a zero to disable logging of that signal or a non-zero value to enable logging of the passed in signal.

Testing

Step 1: Setup syslogd.
Edit /etc/syslog.conf to send *.debug to /var/log/debug. This just takes a line like:
*.debug /var/log/debug
Step 2: The test program.
siglog.c is a quick and dirty C program that takes two arguments on the command line. These arguments are just turned into int's and then passed to the system call. There is no error checking done on the arguments but sys_siglog isn't to picky. Note you do have to be root to change which signals are logged.

Compile the test program with gcc siglog.c -o siglog
Step 3: Testing.
Assuming a make dep && make bzImage ; echo -e "\a" and that the new kernel boots and such then fire up a sleep 100 on one virtual terminal and signal with kill -4 from a second tty. /var/log/debug should have an entry like:
Apr 30 06:04:50 forbidden kernel: Sig: 4 from pid: 44 (bash) to pid: 61 (sleep)

A kill -1 66 (on a new sleep :) should generate something like:
Apr 30 06:05:42 forbidden kernel: Sig: 1 from pid: 44 (bash) to pid: 66 (sleep)

(By the way "forbidden" is the machine's hostname.)

Now mask out signal 4 via ./siglog 4 0 .

Then try kill -4 PID again and check the logfile once more. You should not see an entry.

Lastly try kill -1 PID once more. This time you should again see a log entry.

Now turn signal 4 logging back on via ./siglog 4 1 and check to see that both signals 1 and 4 are logging entries.

The Output

Some typical entries including suspending a process and booting up will look like:

  Apr 30 07:56:05 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 61 (a.out)
  Apr 30 07:56:22 forbidden kernel: Sig: 18 from pid: 44 (bash) to pid: -62
  Apr 30 07:57:08 forbidden kernel: Sig: 11 from pid: 44 (bash) to pid: -62
  Apr 30 07:57:17 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 63
  Apr 30 07:57:20 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 62 (sleep)
  Apr 30 07:57:24 forbidden kernel: Sig: 18 from pid: 44 (bash) to pid: -62
  Apr 30 08:00:36 forbidden kernel: Sig: 15 from pid: 88 (poff) to pid: 74 (pppd)
  Apr 30 08:07:29 forbidden kernel: Sig: 19 from pid: 94 (vi) to pid: 0
  Apr 30 08:07:58 forbidden kernel: Sig: 18 from pid: 92 (bash) to pid: -94
  Apr 30 08:10:23 forbidden kernel: Sig: 19 from pid: 96 (vi) to pid: 0
  Apr 30 08:10:26 forbidden kernel: Sig: 18 from pid: 65 (bash) to pid: -96
  Apr 30 08:33:07 forbidden kernel: Sig: 19 from pid: 503 (vi) to pid: 0
  Apr 30 08:33:41 forbidden kernel: Sig: 18 from pid: 65 (bash) to pid: -503
  Apr 30 08:41:08 forbidden kernel: Sig: 19 from pid: 711 (vi) to pid: 0
  Apr 30 08:41:17 forbidden kernel: Sig: 18 from pid: 65 (bash) to pid: -711
  Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -43
  Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -730
  Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -45
  Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -46
  Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -47
  Apr 30 08:42:27 forbidden kernel: Sig: 15 from pid: 1 (init) to pid: -48
  Apr 30 08:43:18 forbidden kernel: Sig: 15 from pid: 30 (syslogd) to pid: 29 (syslogd)
  Apr 30 08:44:34 forbidden kernel: Sig: 15 from pid: 44 (bash) to pid: 63 (bigloop)
  Apr 30 08:46:11 forbidden kernel: Sig: 4 from pid: 44 (bash) to pid: 80 (bigloop)
  Apr 30 08:47:20 forbidden kernel: Sig: 4 from pid: 44 (bash) to pid: 84 (bigloop)

The End

So did we get the SIGABRT traced down? Sort of. It turns out the signal 6 was self-inflicted by a signal handler.

What signal handler was that? The signal handler for signal 11 (SEGV).

Now where is this signal 11 coming from? What does this logfile show? Nothing. This kernel patch just logs signals sent via the kill system call so we can only see signals sent by user processes. The signal 11 is occurring in the kernel somewhere. It's most likely a bad pointer.

Personally I've seen sig 11's pop up from three causes:

Accessing unmapped memory. (ie a bad pointer)
Running out of stack space. (8 Mega Bytes by default)
Flaky hardware.
Okay four causes: Intentionally sending a signal 11 with kill().

I bumped the stack space up to 32 Megs but the application still crashes and I don't have any problems with other programs crashing like this on that busy machine. It looks like the cause is problems with the server program.

The vendor claimed that upgrading to the enterprise solution would keep the problems at bay. It turns out the (aprox. $4000) "upgrade" was a new serial number, and NOT a new executable, which enabled advanced features in the software but the segfaults are still a problem.