Vous êtes sur la page 1sur 17

The Klogger HOWTO

Yoav Etsion Tal Ben-Nun Dan Tsafrir Dror G. Feitelson

School of Computer Science and Engineering The Hebrew University, 91904 Jerusalem, Israel {etsman,talbn,dants,feit}@cs.huji.ac.il

1 2 3 What is Klogger? How does Klogger Work? Klogger Schemata 3.1 Klogger Conguration File . . . . . . . . . . . . . . . . . . 3.1.1 Conguration File Naming . . . . . . . . . . . . . . 3.1.2 Declaring Events . . . . . . . . . . . . . . . . . . . 3.1.3 Declaring Enumerations . . . . . . . . . . . . . . . 3.1.4 Event Inheritance . . . . . . . . . . . . . . . . . . . 3.2 Logging Events . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Using Hardware Performance Counters . . . . . . . . . . . 3.4 Binding Hardware Performance Counters . . . . . . . . . . 3.5 Reading Hardware Performance Counters . . . . . . . . . . 3.6 Adding Support for Other PentiumIV Performance Counters Compiling a Klogger Enabled Kernel Using Klogger at Runtime 5.1 Enable/Disable Logging . . . . 5.2 Binary to Text Conversion . . . 5.3 Enable/Disable Specic Events . 5.4 Default Events . . . . . . . . . . 5.5 Buffer Size and Low Water Mark 5.6 Event Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 3 3 3 4 4 5 5 6 7 8 8 9 9 9 9 10 10 11 11 11 12 12 13 13 13 14 17 17 1

4 5

Klogger Perl Module 6.1 Control Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Log Analysis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Current Schemata 7.1 Stopwatch Schema . 7.2 Scheduler Schema . . 7.3 Locking Schema . . 7.4 Networking Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8 9

Design and Implementation Future Work

1 What is Klogger?
Klogger is a framework for post mortem analysis of the Linux kernel behavior. It is designed so that developers can easily dene events they want logged and log them in the kernel code. At runtime, the Klogger infrastructure is enabled for the period of time the developer wishes to analyze, logging all events into a special log le, which can later be analyzed. Klogger excels at its low overhead and exibility, allowing researchers to analyze performance bottlenecks that are very difcult to reach with standard logging tools, while not requiring the user to become a kernel hacker. This document is an attempt to allow the research community to gain experience with Klogger, while possibly providing feedback about the tool and its features. Klogger exibility is embodied in its support for different schemata different disjoint event sets, dedicated to the analysis of different subsystems. Schemata enable users to analyze only the subsystem that interests them, and create new Klogger schemata specic for their research (read more about schemata in sections 3 and 7). We are trying to collect a host of schemata which will allow both developers and researchers to study the various kernel subsystems without having to dive into the actual implementations. The best example is a storage systems researcher who does not have to fully understand the implementation of the subsystem and driver in one kernel or another, as long as he can get information about the blocks read from/written to storage, their order, timings etc. By supplying a schema logging the major events of the storage subsystem, we can allow more people to be involved in cutting edge research. The same goes for most parts of modern kernels networking, scheduling, synchronization, le systems, etc.

2 How does Klogger Work?

Klogger is logging framework, and as such, needs some efcient way to save runtime events onto persistent storage. This section will try to overview the basic mechanism used by Klogger. (dening events and logging them is described in Section 3). Klogger employs per CPU memory resident buffers, to which logging actually takes place. The logging code rst checks if Klogger is enabled. If so, space is allocated on the relevant CPUs buffer, and the events payload is saved there in the form of a C structure. The buffer is congured to not only be memory resident, but also uses Intels write-combining cache policy [5], allowing efcient sequential writes and preventing the data from polluting the caches. Each CPU is also attached with a special thread in charge of dumping the memory buffer to disk whenever the latter lls. These threads run at the highest real-time priority in the system, under the SCHED_FIFO scheduling policy [4, 1]. Klogger does not actually wait for the buffer to ll, but rather has a user settable low-water mark indicating what percent of the buffer should be used as a reserve. Whenever the low-water mark is reached, Klogger signals the Linux scheduler to run the appropriate CPUs thread, in order to empty the buffer. When using Klogger to analyze a subsystem that can be affected by the memory dump operation, events caused by Kloggers own action might cause the buffer to ll up, thus causing the loss of other events. Such cases can be identied by gaps in the logged events serial numbers. Another possible problem is if logging occurs too often the overhead caused by emptying the buffer might affect the measurement. For such cases, Klogger provides the user with an interface to increase the buffer size (defaulting at 4MB per CPU) and change the low-water mark. This interface is described in Section 5.5. Experience shows that increasing the buffer is sufcient even in the more extreme cases and allows for logging millions of events per second. One warning though: albeit a very efcient framework, we must strictly advise against using Klogger in production systems. Klogger is designed to be a framework used by researchers and developers. As such, some guidelines that are usually required by subsystem designers were relaxed for example: the decision to allow some minimal (and even negligible) overhead even when logging is disabled. A more comprehensive discussion of Kloggers design and implementation is available in Section 8 and in [3].

3 Klogger Schemata
A Klogger schema denes a specic facet of Klogger. One way of looking at it is that the Klogger framework supplies the mechanism how to log while the schemata supply the policy what to log. Technically, when we refer to a Klogger schema we refer two les: a conguration le dening the events to be logged and their payload, with the other le being a kernel patch containing the invocations of klogger where are the events dened in the rst le actually get logged. The next two sections will describe how to dene and log events. The rst will describe the syntax of the Klogger conguration le, and the second will describe how to log events.

3.1 Klogger Conguration File

Describing the syntax of the conguration le is split into three parts: 1. Conguration le naming (Section 3.1.1). 2. Dening Klogger events (Section 3.1.2). 3. Using the optional hardware performance counters (Section 3.3). To clear up the picture, we will accompany those sections with a simple example: an event named SCHEDOUT which indicates that the CPU scheduler has switched out the running process, logging its pid. 3.1.1 Conguration File Naming Kloggers conguration les reside in the kernel source root and are named .klogger.conf.NAME where NAME is the name of the schema they dene. The separation into multiple conguration les allows great exibility when adding or removing logging schemata. All Klogger enabled kernels must have at least one conguration le, named .klogger.conf.base which is included alongside the main Klogger kernel patch and contains denitions for the build in Klogger events, such as when it was enabled/disabled, what processes are running when its enabled, etc. This le can serve as a basic example of a Klogger conguration le. 3.1.2 Declaring Events Events are declared in a syntax similar to C structs. The basic template is: event EVENTNAME { typeA varname1; typeB varname2; ... typeN varnameX; } with the various types known to Klogger listed in Table 1. Note that the usage of strings is highly discouraged since it requires the use of buffer copy functions, dramatically increasing the logging overhead (though strings up to the size of the underlying hardwares cache line are optimized into simple long assignments). In our SCHEDOUT example, the event will thus be dened as: event SCHEDOUT { int pid } and will appear in the nal text log as:

Klogger Type short ushort int long longlong ushort uint ulong ulonglong stringN

C Type short unsigned short int long long long unsigned short unsigned int unsigned long unsigned long long char[N]

Table 1: The basic types available in a Klogger event, and their C equivalents. { header => { "type" "serial" "timestamp" }, "pid" }, Though extracting the text based log is described in Section 5.2, for now simply note that each event appears as a Perl hash. The hash contains all the elds described in the event denition and a header containing the event type (as text), its serial number in the log and a timestamp taken from the hardwares cycle counter. 3.1.3 Declaring Enumerations Enumerations are declared in a syntax similar to C enums. The basic template is: enum enumerationtype { one, minusone = -1, four = 4, TWO = 2, Three } The type enumerationtype may be later used in an event as a type. Integer values will automatically be converted to their equivalent strings when decoding Kloggers output. 3.1.4 Event Inheritance Event inheritance is an advanced feature of Klogger, enabling analysts the denitions of inclusive groups of events, thereby dening different log levels. The inheritance model, allows schema authors to specify the levels of the various events logged by Klogger and include log levels in other log levels using the include command. Similar to C++ syntax, the currently used log level will be declared before all log level declarations with the using command. Furthermore, when a schema author would like to omit an event from an included log level, he may use the omit command. Such action could be done in order to lter out information unneeded in a specic log level, thereby reducing the information generated by Klogger. 4 => "SCHEDOUT", => "119", => "1032071755760", => "1073",

Event inheritance syntax is as follows: using two loglevel one < event ONE_DO { ... } event ONE_SIMPLE { ... } > loglevel two < include one omit ONE_SIMPLE event TWO_DETAILED { ... } > A more detailed example of log levels can be found in the network schemas conguration le.

3.2 Logging Events

Logging events is performed using a special C macro called klogger. First, well need to include the main Klogger header le: #include <linux/klogger.h> The syntax for the klogger macro resembles that of printf, with the rst argument being the event name, followed by values for each of the event elds as ordered in the event description. The call for our general event template will thus be: klogger(EVENTNAME, varname1, varname2, ... , varnameX); This call is converted by the C preprocessor into an inlined event specic logging function. However, If the kernel is compiled without Klogger support, this call converts into a null expression by the preprocessor. Sometimes special code is needed (like using temporary variables) for non-trivial logging. Such code should be protected using an ifdef statement checking the existence of the CONFIG_KLOGGER macro, so it can be eliminated if the Klogger is not congured into the kernel. Returning to our SCHEDOUT example, the call logging the event will simply be: klogger(SCHEDOUT, task->pid); assuming task is a variable of type struct task_struct* pointing to the process that was just preempted off the CPU.

3.3 Using Hardware Performance Counters

Most modern processors support hardware performance counters, measuring various aspects of the processors operations. These counters can provide information varying from the number of cache misses for each cache level,

Counter Name dtlb_miss dtlb_miss_os dtlb_miss_user itlb_miss itlb_miss_os itlb_miss_user l1_cache_misses l1_cache_misses_os l1_cache_misses_user l2_cache_misses l2_cache_misses_os l2_cache_misses_user

Counter Description Page walks for a data TLB miss, all protection levels. Page walks for a data TLB miss, OS protection level. Page walks for a data TLB miss, user protection level. Page walks for an instruction TLB miss, all protection levels. Page walks for an instruction TLB miss, OS protection level. Page walks for an instruction TLB miss, user protection level. L1 cache misses, all protection levels. L1 cache misses, OS protection level. L1 cache misses, user protection level. L2 cache misses, all protection levels. L1 cache misses, OS protection level. L1 cache misses, user protection level.

Table 2: Hardware performance counters currently supported for the PentiumIV architecture. For full descriptions and the list of all counters supported by the hardware please refer to [5]. through the number of retired instructions, and sometimes even the number of memory operations queued on the processor. Needless to say, these counters are processor dependant. Moreover, these counters may not even be backward compatible with previous versions of the same architecture. To make Klogger as efcient as possible, all pieces of code accessing the hardware performance counters are inlined. The result is that all abstractions used are taking place at compile time, incurring no runtime overhead from abstract indirections (as oppose to the approach taken by other tools, such as PAPI [2]). In that context, user should be warned that since the counters are processor model dependant, code compiled with performance counter support for one model may not run on another! Simply put, kernels compiled to utilize PentiumIIIs performance counters might not even boot on a PentiumIV machine.

3.4 Binding Hardware Performance Counters

In order to use the hardware counters, we must add an architecture section to the conguration le, specifying the underlying hardware and binding Kloggers virtual counters to specic events supported by the hardware. The basic structure is arch ARCH_NAME { counter1 HARDWARE_COUNTER_NAME_A; counter2 HARDWARE_COUNTER_NAME_B; ... counterN HARDWARE_COUNTER_NAME_X; } where ARCH_NAME is the architecture name (i.e. PentiumPro, PentiumIII, PentiumIV, etc.), 1 counter1...N are the virtual counter names, and HARDWARE_COUNTER_NAME_A...X are the hardware specic counter names. The list and descriptions of the hardware specic counter names supported for the given architecture can be viewed by running the command ./scripts/klogger-autogen.pl --list-counters from the Kloggers kernel source root. For example, the counters currently supported for the PentiumIV architecture are listed in table 2.
1 Embarrassingly, the only architecture currently supported is the PentiumIV. However, adding new architectures is not a very difcult task, but requires a thorough reading of the relevant chapter in [5]. We would be glad to assist anyone willing to undertake such a task...

Lets return to our SCHEDOUT example for a moment. Imagine we want Klogger to log not only the preempted processs pid, but also to log the number of overall L2 cache misses it has suffered. Since we only want the event occurring when the process was running (regardless of any interrupts handled by the kernel during that time), we need to bind the rst virtual counter to the underlying hardwares counter of the L2 cache misses occurring at the user protection level the l2_cache_misses_user counter. As such, the architecture dependant section in the conguration le would look like this: arch PentiumIV { counter1 l2_cache_misses_user; } Also, we would add another 64bit integral eld to the SCHEDOUT event, saving the counters value at the moment of preemption: event SCHEDOUT { int pid ulonglong l2_cache_misses_user } which means the textual log will hold the event as { header => { "type" "serial" "timestamp" }, "pid" "l2_cache_misses_user" }, The next section describes how to read the hardwares performance counters, and how to log our extended SCHEDOUT event. => "SCHEDOUT", => "119", => "1032071755760", => "1073", => "35678014",

3.5 Reading Hardware Performance Counters

Reading the performance counters is done by two types of functions: The rst type is generic, accepting which counter to read as an argument. The second type is logical, with the event name being part of the functions name. This separation allows the programmer two different modus operandi: 1. Read what ever is in a specic counter to generate log, allowing changes to the conguration le to change the counters type, with no need to modify the code itself. 2. Require that a logical event be dened in some counter, when the logical event has specic importance. Both types are implemented using compile time generated inlined functions. The reasoning behind this implementation is that while it would be simpler to just generate a logical to physical counter map and have a function use that map, such method involves more overhead and cache pollution. The rst, counter-based interface is dened as: unsigned long klogger_get_counter(unsigned long ctr); unsigned long long klogger_get_counter_ll(unsigned long ctr); where the rst function returns just the lower 32 bits of the 64 bit counter, and the second function returns the entire 64 bit number. The ctr argument is a macro, named KLOGGER_COUNTERN where N is the counter number i.e. KLOGGER_COUNTER1, KLOGGER_COUNTER4, etc. The second function type is declared as: 7

unsigned long unsigned long

klogger_get_EVENTNAME() klogger_get_EVENTNAME_ll()

where EVENTNAME is the actual name used in the conguration le. Returning once again to our SCHEDOUT example, we can read the counter we dened using either function type. We bound logical counter 1 to the l2_cache_misses_user event. Thus, using the rst function type, logging would take place using the following line: klogger(SCHEDOUT, task->pid, klogger_get_counter(KLOGGER_COUNTER1); If we want to use the function based on the event names, we just use the following line: klogger(SCHEDOUT, task->pid, klogger_get_l2_cache_misses_user()); Sometimes the logging is not completely straightforward, involving some code preparing the data to be logged. In such cases, it may be important to know for which architecture the kernel was compiled. For that reason we use the KLOGGER_ARCH_COMPILED macro which is set to one of the klogger_arch_t enumeration values, dened in include/asm/klogger.h.

3.6 Adding Support for Other PentiumIV Performance Counters

The PentiumIV architectures performance counters support many different events, only few of which we support (Table 2). Adding support for other counters is not difcult, but rather time consuming. As such, we have decided to take the lazy road each programmer is welcome to add support for the events she needs. Adding support is very simple. It only involves copying bit conguration from the long list of events listed in Appendix A of [5] (available online at http://developer.intel.com/design/Pentium4/documentation.htm) into the Perl hash named %EVENTS found in KERNEL_ROOT/scripts/klogger/PentiumIV.pm (the module implementing the hardware counters for the PentiumIV architecture). The copying is straight forward as each entry in the hash uses similar naming as that in the architectures manual, with only two new elds: the event name, and its description. Please send us any new hash entries you add so it can be added to the main distribution.

4 Compiling a Klogger Enabled Kernel

For simplicity, let us assume you have opened the Klogger distribution into a directory called KLOGGER. Adding Klogger support is a simple process: 1. Patching klogger Add the Klogger framework to the kernel using the base patch, and copy the base conguration le to the kernel root:

foo[ linux-2.6.9 ] patch -p1 < KLOGGER/klogger-2.6.9/patch/klogger-2.6.9.patch foo[ linux-2.6.9 ] cp KLOGGER/klogger-2.6.9/patch/klogger.conf.base .klogger.conf.ba 2. Adding schemata Now you can either create your own schema, or use a ready made one supplied with the distribution (for example, the scheduler schema): foo[ linux-2.6.9 ] patch -p1 < \ KLOGGER/klogger-2.6.9/schemata/sched/klogger-scheduler-schema-2.6.9.patch foo[ linux-2.6.9 ] cp KLOGGER/klogger-2.6.9/schemata/sched/klogger.conf.sched \ .klogger.conf.sched

3. Enable KLogger in the kernel conguration Enable [Kernel hacking > KLogger] in the kernel conguration menus. A second option, [Kernel hacking > KLogger benchmarks] also enables runtime measurements of each events logging overhead (more on benchmarks in Section 5.6). Now compile the kernel, and log like the wind...

5 Using Klogger at Runtime

Klogger offers a control interface through the /proc lesystem (this is actually sysctl options exported through the /proc lesystem). All the control les reside under /proc/sys/klogger. The convention is that event names are in upper case, so all les whose name consists of upper case letters represent events, while lower-cased les represent other conguration options. Note that Kloggers setting cannot change while it is enabled!

5.1 Enable/Disable Logging

Kloggers main control switch toggling logging on or off is done via the /proc/sys/klogger/enabled le. By writing the text string 1 or 0 into that le, you can enable or disable logging, respectively. Logs are saved in binary form into per-CPU les residing under the /tmp directory. These dump les are named /tmp/kernlog.dump.cpuN where N is the CPU number. The dump les might get very big, depending on the rate by which events are logged as such, make sure you have adequate space on the lesystem inside which /tmp resides. Note of caution for SMP users: inter-CPU synchronization is based on the Linux kernel synchronization, and we have not tested its accuracy, nor the drift between CPU clocks. While the assumption that the drift is very small due to the fact that different CPUs on the same machine experience the same operating conditions (temprature/power), the same does not hold for the accuracy of the mechanism resetting all CPU clocks during Linux boot time. This is denitely on our TODO list.

5.2 Binary to Text Conversion

When building Klogger with a specic set of schemata, a decoding utility is created for that specic kernel able to decode the binary log into a more human friendly, text formatted le. As mentioned earlier, the text format is basically an array of Perl Hashes, each represeting an event. Assuming your Klogger kernel sources reside under KERN_ROOT, the decoder is compiled into KERN_ROOT/kdecoder/klogger_decoder This tools operation is very simple: its only command line argument is the binary log, whose textual conversion is printed into the standard output. As such, the usual command line would look like: KERN_ROOT/kdecoder/klogger_decoder /tmp/kernlog.dump.cpu0 > kernlog.cpu0.txt

5.3 Enable/Disable Specic Events

Sometimes a Klogger is compiled with a broad set of events, but during certain measurement one would like to log only a subset of these. As such, Klogger is equipped with a simple mechanism to block logging of specic events. Under the /proc/sys/klogger directory, you can nd a special le named after each of the events compiled in. This special le is an event specic toggle switch. By writing the strings 0 or 1you can block or unblock logging of the specic event, respectively. Note that the overhead imposed by an events logging code when it is disabled either specically, or because Klogger is disabled entirely is similar.

5.4 Default Events

Klogger has several base events, used to account for invocation times, buffer dump times, and the state of the system when the logging starts i.e. all the processes in the system. These events are: 1. KLOGGER_START Logged whenever Klogger is enabled (the rst event logged), saving the kernel version and the size of the memory resident buffer used per CPU. 2. KLOGGER_FINISH The reciprocal of the previous event. Simply marks when Klogger was stopped. 3. PROCESS_EXISTS When Klogger is enabled, all existing processes are logged, being the mainline of the systems status. This event is logged for each existing process, holding the processs general information: pid, scheduling priority, scheduling policy, and its command line. Admittedly, this event is pretty expensive overhead-wise, it is only logged once as part of the logs header. 4. LOGGING_EVENT A complement to the system status is Kloggers status, mainly the events that are being logged. Again, a LOGGING_EVENT is saved for each event name Klogger is congured to log. 5. DUMP_BEGIN This event is logged whenever Klogger dumps its memory resident buffer to the disk. Its main purpose is that post mortem analysis of the log can isolate event implicitly generated by Klogger itself. For example, when analyzing disk trafc, one would like to discard trafc generated by the measurement itself. This event also saved the dump index the index of the dump during the current logging session. 6. DUMP_FINISH Again, a reciprocal of the former, indicating the end of the buffer dump. Its payload also includes the number of bytes actually dumped. 7. MARK A special event being used to mark places in the log for example, a measurement can have several phases, whose transitions the user would like mark in the log. Generating this event is done by writing some number string into a special le: /proc/sys/klogger/mark. Doing this will cause the event to be logged, along with the logging processs pid and the number written into the le. The written number can be used by the user to separate the different MARK events.

5.5 Buffer Size and Low Water Mark

Changing the buffer size is done via the /proc/sys/klogger/buffer_size le by simply writing a new size in units of MB (the size refers to the buffer of each CPU, so the actual memory allocated is P size, where P is the number of processors in the system). Reading the le shows the buffer size allocated for each CPU. For example, if we want to change the buffer size to 16MB per CPU right after boot the command set would be: foo[ /proc/sys/klogger ] cat buffer_size 4MB/CPU foo[ /proc/sys/klogger ] echo 16 > buffer_size foo[ /proc/sys/klogger ] cat buffer_size 16MB/CPU (note that although writing to the le is done with only numbers, when reading the le the units are appended to the data).


To avoid stressing the memory by allocating large, physically continuous buffers Klogger uses virtually continuous buffers. However, there is a limit on the kernels virtual memory size, which is set to 128MB on Linux 2.6.9 kernels. If the total memory allocated exceeds this limit, you can boot the machine with the special vmalloc boot time ag, resetting the kernels virtual memory limit. For example, when using 128MB buffers on a 4-way machine, we added the vmalloc=640MB boot parameter so the kernel will have enough virtual memory for both the 128 4 = 512M B needed by Klogger, alongside the original virtual memory space it uses. Resetting the low-water mark is done in a similar manner, using the /proc/sys/klogger/lowwater le. The number written to that le is the percent of the memory buffer space below which the buffer will be emptied (default is 10%). For example: foo[ 10% foo[ foo[ 5% foo[ /proc/sys/klogger ] cat lowwater /proc/sys/klogger ] echo 5 > lowwater /proc/sys/klogger ] cat lowwater /proc/sys/klogger ]

5.6 Event Benchmarks

If the kernel was congured with Kloggers internal benchmarking mechanism (kernel option [Kernel hacking > KLogger benchmarks] enabled), you will nd a directory called /proc/sys/klogger/benchmarks, in which there is a le matching each events name. The les contents is the events logging overhead, in CPU cycles. When a le is read, the matching event is logged several times to measure the average overhead. The number of iterations can be tuned using the special /proc/sys/klogger/benchmarks/iterations le (default is 1000 iterations). For example, on our PentiumIV machine clocked at 2.8GHz the overhead of the KLOGGER_START event is 234 cycles (83.6 nanoseconds!), while the overhead of the more complex PROCESS_EXISTS (which contain character strings) is 16614 cycles (5.93us).

6 Klogger Perl Module

Perl is a scripting language providing rapid development for both text analysis and execution of specic workloads. As such, Klogger includes a Perl modules to simplify both logging of workloads and the analysis of the logs. When examining the effects of specic workloads on kernel behavior using Klogger, Perl can be used to automatically generate these workloads by running sequences of the needed applications. Afterwards, Perl can serve as a great tool to analyze log les, generate statistics and look for kernel behavioral patterns expressed by the various events. In order for Perl to load the module, add the directory in which the module resides to the PERL5LIB environment variable.

6.1 Control Functions

The rst group can be regarded as Klogger control functions, used in automated workload generating scripts. These are just wrappers for the basic shell commands introduces so far: klogger_start_logging() As the same says: starts logging! (an equivalent to writing 1 into /proc/sys/klogger/enabled). klogger_stop_logging() Another surprise: the function simply stops the logging! (equivalent to writing 1 into /proc/sys/klogger/enabled). klogger_dump2txt(kernel_home, dumple, txtle) Convert a binary dump le into the textual log format. The arguments are: kernel_home: the path to Kloggers kernel source tree. Needed for the binary2text decoder. 11

Algorithm 1 The basic skeleton of a Klogger analysis script. #!/usr/bin/perl -w use klogger; # Open the log file my $logh = klogger_open_log("/some/log/file.txt"); # While events keep coming, process them. for(my $event = klogger_next_event($logh); defined($event); $event = klogger_next_event($logh)) { Process the event... } # close the log klogger_close_log($logh);

dumple: the binary le to be converted. txtle: the output text le.

6.2 Log Analysis Functions

The second group of functions is used for log analysis: klogger_open_log(text_log_lename) Opens the le text_log_lename for analysis. Returns a handle to the open log, or the undened value if the open failed (the handle can only be used with the following functions). klogger_next_event(log_handle) Given a handle to an open log, return the next event in the log. The return value is a Perl reference to a Hash containing the next event in the log (as depicted in the textual le format). When reaching the end of the log the undened value is returned. klogger_get_processes(log_handle) Returns a reference to a Hash containing the data about the running processes, indexed by the processes pids. klogger_close_log(log_handle) Close the open log associated with the given handle. Using the log analysis functions, analysis scripts can be built around the code depicted in Algorithm 1.

7 Current Schemata
As previously described, a major goal when designing Klogger was to separate the mechanism from the policy. While the core of the framework is regarded as the mechanism, the schemata are the available policies. In this section we describe the schemata that are supplied with Klogger itself. We hope users will develop more schemata, sharing them with us and the community, so that future versions of Klogger will deliver a variety of subsystems schemata.


7.1 Stopwatch Schema

The stopwatch schema is brought here mainly to serve an example. As its name suggests, it is simply a stopwatch: it has two events, EVENT_START and EVENT_STOP, marking the beginning and end of some events, respectively. Since it is a generic stopwatch, the schema does not even has a matching kernel patch, simply because it is aimed a measuring anything just place the klogger calls before and after the event you wish to measure. Each events has an id eld, to match a start event to its adjoining stop event, and to separate different events from each other. By checking the timestamp in the headers of both events, you can measure the cycles the event in question took. To measure events, simply use the following template in your kernel code: klogger(EVENT_START, user_chosen_event_id); /* The event in question... */ klogger(EVENT_STOP, user_chosen_event_id); Use this basic schema as a template, adding more data to the event payload to get a feeling of kloggers operation.

7.2 Scheduler Schema

The scheduler schema consists 8 events depicting all process states (to view the exact payload of each process look at the schemas conguration le): 1. TRY_TO_WAKEUP A process was supposed to have been woken up. The TRY prex emphasize that the process might have already been awake (for example, a running process have been sent a signal). 2. REMOVE_FROM_RUNQ A process has became non-runnable (exited, blocked on I/O) and has been removed from the run queue. 3. ADD_TO_RUNQ A process has became runnable and was added to the run queue. 4. SCHEDOUT A process was preempted as part of a context switch. 5. SCHEDIN A process was allocated a CPU as part of a context switch. 6. FORK The fork system call was called. 7. EXEC The exec system call was called. 8. EXIT A process exited.

7.3 Locking Schema

This schema tries to catch the essence of Linuxs locking mechanism the subsystem used to achieve inter-processor mutual exclusion. Linux uses the busy-wait approach for locking, and three major lock type: 1. spinlock: only allow one processor to execute a critical section at any given time. 2. read/write spinlock: asymmetric locks, which allow multiple read-only contexts to execute simultaneously, while writers contexts get exclusive access. 13

3. Big Kernel Lock (BKL), a relic from early SMP support in Linux, which was meant to be a transitional solution from the monolithic support (only one CPU running kernel code at any given time) to the ne grained support (separate locks for each global data structure). Although the BKL has been deemed a deprecated feature, it is still widely used in some parts of the kernel. A more detailed discussion of locking in Linux can be found in [1] (and a general discussion in [6]). The events included in this schema are: 1. SPINLOCK_INIT A spinlock was initialized. 2. SPINLOCK_FINISH A spinlock was released. This event also saves the time in which the lock was acquired. 3. PREEMPT_SPINLOCK_FINISH A spinlock was released on kernel preemption. This event also saves the time in which the lock was acquired. 4. RWLOCK_INIT A read/write lock was initialized. 5. READLOCK_FINISH A spinlock was released from reader context. This event also saves the time in which the lock was acquired. 6. WRITELOCK_FINISH A spinlock was released from writer context. This event also saves the time in which the lock was acquired. 7. PREEMPT_WRITELOCK_FINISH A spinlock was released from writer context on kernel preemption. This event also saves the time in which the lock was acquired. 8. BKL_LOCK_FINISH The BKL was released (again, also saving the in which time it was acquired).

7.4 Networking Schema

The networking schema enables the user to run networking based tests and analyze the data being transferred through the layers. Currently supported layers are the socket layer, the protocol(TCP/UDP/ICMP) layer and the IP layer. The schema consists of 6 enumerations and 38 events, separated to log levels using Kloggers event inheritance system, the top level being the highest layer. That way, an analyst may choose the amount of logging he requires of Klogger and receive the wanted results in compile time. No modications to kernel code are required. Currently, we only support the IPv4 protocol stack. Socket level: 1. SOCKET_CREATE A socket has been created due to a request. The sockets new address in memory is logged (and could be used to track socket activities) and its domain and protocol. 2. SOCKET_CLOSE A socket has been closed and will not be active anymore. 3. SOCKET_ASSIGNADDR A socket has been assigned to an IP address and port and was set to communicate using a specied protocol. 4. SOCKET_LISTEN A socket has been assigned to be a listening socket and will listen for data. 14

5. SOCKET_ACCEPT A socket has been accepted to connect to another socket. 6. SOCKET_WRITE A socket is writing data to its target. 7. SOCKET_RECEIVE A socket is receiving (processed) data from its target. Connection level: 1. SOCKET_BIND (REPLACES SOCKET_ASSIGNADDR) A socket has been bound to an IP address and port and was set to communicate using a specied protocol. 2. SOCKET_CONNECT (REPLACES SOCKET_ASSIGNADDR) A socket has connected to an IP address and port and was set to communicate using a specied protocol. 3. CONNECTION_HANDSHAKE A socket has begun in a protocol connection handshake. 4. CONNECTION_CLOSE A socket has closed its connection with the other socket it was connected to. Protocol level: 1. TCP_SENDSYN A SYN packet has been sent from the specied TCP socket. 2. TCP_RECVSYN A SYN packet has been received to the specied TCP socket. 3. TCP_SENDSYNACK A SYN-ACK packet has been sent from the specied TCP socket. 4. TCP_RECVSYNACK A SYN-ACK packet has been received to the specied TCP socket. 5. TCP_SENDACK An ACK packet has been sent from the specied TCP socket. 6. TCP_RECVACK An ACK packet has been received to the specied TCP socket. 7. TCP_SEND The specied amount of data is being sent from the specied TCP socket. 8. TCP_SENDPACKET A fragmented/whole (unspecied) TCP packet is being queued for sending in the specied TCP socket. 9. TCP_RECEIVE The specied amount of data (processed) is being received to the specied TCP socket. 10. TCP_RECEIVEPACKET A fragmented/whole (specied) TCP packet has been received to the specied TCP socket and is queued for processing. 15

11. TCP_RECV_URGENTDATA The specied amount of data (processed) is being received to the specied TCP socket. The packet carries the URG (urgent) ag and is processed before other non-urgent data. 12. TCP_DISCONNECT A TCP socket has disconnected from its target. 13. UDP_PUSHPENDINGFRAMES A UDP socket is pushing all of its pending frames to a specied socket buffer. 14. UDP_FLUSHPENDINGFRAMES There has been a cork in the specied UDP socket and therefore all of the pending frames are being ushed immediately. 15. UDP_QUEUESOCKETBUFFER A UDP socket is queueing IP packets received from a specied socket buffer for processing. 16. UDP_RECEIVE The specied amount of data (processed) is being received to the specied UDP socket. 17. UDP_SEND The specied amount of data is being sent from the specied UDP socket. 18. UDP_CONNECT A UDP socket has connected to its target using the UDP protocol. 19. UDP_DISCONNECT A UDP socket has disconnected from its target. 20. ICMP_SEND An ICMP packet, containing constant data (specied as enumerated code types), is being sent. 21. ICMP_REPLY An ICMP packet reply packet is being sent back to the ICMP packet sender. 22. ICMP_RECEIVE An ICMP packet, containing constant data (specied as enumerated code types), has been received. 23. ICMP_PUSHPENDINGFRAMES An ICMP socket is pushing all of its pending frames to a specied socket buffer. IP Level: 1. IP_PUSHPENDINGFRAMES Pushed pending frames from a protocol have been received from a specied socket into the specied socket buffer. 2. IP_FLUSHPENDINGFRAMES There has been a cork in the specied socket buffer and therefore all of the pending frames are being ushed immediately and wont be sent. 3. IP_SENDPACKET The specied IP socket buffer has sent a packet. Whether its a fragment or not, its fragment offset, packet size, total size, the requesting PID and the current MTU are all logged. 4. IP_RECEIVEPACKET The specied IP socket buffer has received a packet (unprocessed). Whether its a fragment or not, its fragment offset, packet size, total size, the requesting PID and the current MTU are all logged. NOTE: The total size of the de-fragmented data is not applicable until the last fragment arrives.


8 Design and Implementation

TBD (read the technical report [3] for now)

9 Future Work
1. Complete this document. 2. More schemata 3. Add support for hardware performance counters on a variety of architectures. 4. Make Klogger available at boot time. 5. Let the user change the output log le name. 6. Verify cycle synchronization on SMPs.

[1] D. P. Bovet and M. Cesati. Understanding the Linux Kernel. OReilly & Associates, 2nd edition, 2003. [2] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Supercomputing, Nov 2000. [3] Y. Etsion, D. Tsafrir, S. Kirkpatrick, and D. G. Feitelson. Fine grained kernel logging with klogger: Experience and insights. In 2nd ACM EuroSys, pages 259272, Mar 2007. [4] B. O. Gallmeister. Posix. 4: Programming for the Real World. OReilly & Associates, January 1995. [5] Intel Corp. IA-32 Intel Achitecture Software Developrs Manual. Vol. 3: System Programming Guide. [6] C. Schimmel. UNIX Systems for Modern Architectures. Addison Wesley, 1994.