Arun Thomas<arun.thomas@acm.org>
@arunthomas
BSDCan 2017
James Mickens, The Night Watch:
[A] systems programmer has seen the terrors of the world and understood the intrinsic horror of existence
James Mickens, The Night Watch:
When you debug ... an OS kernel, you do it Texas-style. You gather some mean, stoic people, people who have seen things die, and you get some primitive tools, like a compass and a rucksack and a stick that’s pointed on one end, and you walk into the wilderness and you look for trouble, possibly while using chewing tobacco.
It's like packing military-grade GPS and a lightsaber in your rucksack before heading into the wilderness
And looking for trouble
# dtrace -qn 'syscall:::entry { printf("%s %s\n", execname, probefunc); }'
sshd read
sshd getpid
sshd select
sshd sigprocmask
sshd sigprocmask
sshd write
sshd select
ntpd sigprocmask
ntpd sigreturn
ntpd select
read()
Sizes (syscall Provider)# dtrace -qn 'syscall::read:return /execname == "sshd"/ { @ = quantize(arg0); }'
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@ 23
1 |@@@@@@@@@@ 41
2 | 1
4 |@@@@ 19
8 |@ 4
16 | 2
32 |@ 6
64 |@@ 9
128 |@@@ 12
256 |@@@@@ 20
512 |@@ 9
1024 |@@@ 12
2048 |@ 4
4096 |@ 4
8192 |@ 3
16384 | 0
# dtrace -n 'fbt::malloc:entry /pid != $pid/ { printf("%s %s\n", execname, arg0); }'
dtrace: description 'fbt::malloc:entry ' matched 1 probe
CPU ID FUNCTION:NAME
1 32051 malloc:entry sendmail 44
0 32051 malloc:entry syncer 496
2 32051 malloc:entry ntpd 16
# dtrace -qn 'pid$target:libc.so.7::entry { @[probefunc] = count(); }' -c /bin/ls
...
memset 69
mbrtowc 76
wcrtomb 76
__free 104
free 104
memcpy 119
strncmp 252
proc
Provider (SDT)What happens when we "man man
"?
# dtrace -n 'proc:::exec-success { printf("%s", curpsinfo->pr_psargs); }'
dtrace: description 'proc:::exec-success ' matched 1 probe
CPU ID FUNCTION:NAME
1 14 none:exec-success /bin/sh /usr/bin/man man
1 14 none:exec-success /sbin/sysctl -n hw.machine_arch
2 14 none:exec-success /sbin/sysctl -n hw.machine
2 14 none:exec-success /usr/bin/locale
2 14 none:exec-success /usr/bin/zcat /usr/share/man/man1/man.1.gz
0 14 none:exec-success head -1
2 14 none:exec-success /usr/bin/zcat /usr/share/man/man1/man.1.gz
1 14 none:exec-success mandoc -Tlint -Wunsupp
2 14 none:exec-success /usr/bin/zcat /usr/share/man/man1/man.1.gz
1 14 none:exec-success mandoc
0 14 none:exec-success less
From bmc Wed Sep 3 10:27:51 2003
Subject: Houston, Tranquility Base here...
To: dtrace-interest@kiowa
DTrace has landed. Twenty-three months after we set out, the first cut
of DTrace has integrated into Solaris 10.
DTrace merge stats from Bryan's announcement:
Number of source files in usr/src/uts 1,757
Rank, in lines of code, of dtrace.c among these 12
Rank, in number of assertions, of dtrace.c among these 7
Lines in new .c and .h files created by DTrace project 54,644
Lines of code (lines with trailing semicolon) in same 17,088
Lines of comments in same 6,713
Age, in months, of DTrace project 23
Number of engineers on DTrace project 3
Number of above neither married nor engaged at beginning of project 3
Number of above either married or engaged at end of project 2
Subscribers to dtrace-interest@kiowa 181
Number of people who asked to unsubscribe from dtrace-interest@kiowa 1
write()
)dtrace(1)
)libdtrace
- API for consumers, talks to kernel DTrace via ioctl
ioctl()
to /dev/dtrace/*
D Intermediate Format (DIF)
DT_VAR
) such as execname
# dtrace -S -n 'syscall:freebsd:open:entry {printf("%s %s\n", execname, copyinstr(arg0)); }'
DIFO 0x0x8049c7150 returns string (unknown) by ref (size 256)
OFF OPCODE INSTRUCTION
00: 29011801 ldgs DT_VAR(280), %r1 ! DT_VAR(280) = "execname"
01: 23000001 ret %r1
...
dtrace_probe(probe_id)
somehow (e.g., runtime patching the executable)Bryan Cantrill's Reddit AMA illuminates our path:
dtrace(1)
NAME
dtrace - generic front-end to the DTrace facility
lockstat(1)
NAME
lockstat - report kernel lock and profiling statistics
plockstat(1)
NAME
plockstat - Trace pthread lock statistics using DTrace
sysctl debug.dtrace.verbose_ioctl=1
dtrace_ioctl(411): DTRACEIOC_ENABLE
dtrace_ioctl(584): DTRACEIOC_GO
dtrace_ioctl(388): DTRACEIOC_DOFGET
dtrace_ioctl(388): DTRACEIOC_DOFGET
dtrace_ioctl(778): DTRACEIOC_STATUS
dtrace_ioctl(237): DTRACEIOC_BUFSNAP curcpu 2 cpu 0
dtrace_ioctl(338): copyout the buffer snapshot
dtrace_ioctl(357): copyout buffer desc: size 0 drops 0 errors 0
dtrace(1)
dtrace
command is here:
plockstat(1)
as it's simplerplockstat(1)
cddl/contrib/opensolaris/cmd/plockstat/plockstat.c
if ((g_dtp = dtrace_open(DTRACE_VERSION, 0, &err)) == NULL)
fatal("failed to initialize dtrace: %s\n",
dtrace_errmsg(NULL, err));
...
if ((prog = dtrace_program_strcompile(g_dtp, g_prog,
DTRACE_PROBESPEC_NAME, 0, 0, NULL)) == NULL)
dfatal("failed to compile program");
if (dtrace_program_exec(g_dtp, prog, &info) == -1)
dfatal("failed to enable probes");
...
if (dtrace_go(g_dtp) != 0)
dfatal("dtrace_go()");
...
do {
...
switch (dtrace_work(g_dtp, stdout, NULL, chewrec, NULL)) {
case DTRACE_WORKSTATUS_DONE:
done = 1;
break;
...
} while (!done);
Key headers:
cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c:
Which D language constructs are supported?
static const dt_ident_t _dtrace_globals[] = {
...
{ "execname", DT_IDENT_SCALAR, 0, DIF_VAR_EXECNAME,
DT_ATTR_STABCMN, DT_VERS_1_0, &dt_idops_type, "string" },
{ "exit", DT_IDENT_ACTFUNC, 0, DT_ACT_EXIT, DT_ATTR_STABCMN, DT_VERS_1_0,
&dt_idops_func, "void(int)" },
...
{ "strlen", DT_IDENT_FUNC, 0, DIF_SUBR_STRLEN, DT_ATTR_STABCMN, DT_VERS_1_0,
&dt_idops_func, "size_t(const char *)" },
...
};
dtrace_prog_t
constructiondtrace_prog_t
-> DOF conversionFramework paths:
Provider paths:
From sys/cddl/contrib/opensolaris/uts/common/dtrace/dtrace.c
/*
* DTrace - Dynamic Tracing for Solaris
*
* This is the implementation of the Solaris Dynamic Tracing framework
* (DTrace). The user-visible interface to DTrace is described at length in
* the "Solaris Dynamic Tracing Guide". The interfaces between the libdtrace
* library, the in-kernel DTrace framework, and the DTrace providers are
* described in the block comments in the <sys/dtrace.h> header file. The
* internal architecture of DTrace is described in the block comments in the
* <sys/dtrace_impl.h> header file. The comments contained within the DTrace
* implementation very much assume mastery of all of these sources; if one has
* an unanswered question about the implementation, one should consult them
* first.
dtrace_probe()
/*
* If you're looking for the epicenter of DTrace, you just found it. This
* is the function called by the provider to fire a probe -- from which all
* subsequent probe-context DTrace activity emanates.
*/
void
dtrace_probe(dtrace_id_t id, uintptr_t arg0, uintptr_t arg1,
uintptr_t arg2, uintptr_t arg3, uintptr_t arg4)
dtrace_probe()
- probe context * dtrace_probe() may be called in virtually any context: kernel, user,
* interrupt, high-level interrupt, with arbitrary adaptive locks held, with
* dispatcher locks held, with interrupts disabled, etc. The only latitude
* that must be afforded to DTrace is the ability to make calls within
* itself (and to its in-kernel subroutines) and the ability to access
* arbitrary (but mapped) memory. On some platforms, this constrains
* context. For example, on UltraSPARC, dtrace_probe() cannot be called
* from any context in which TL is greater than zero. dtrace_probe() may
* also not be called from any routine which may be called by dtrace_probe()
* -- which includes functions in the DTrace framework and some in-kernel
* DTrace subroutines. All such functions "dtrace_"; providers that
* instrument the kernel arbitrarily should be sure to not instrument these
* routines.
static void
fbt_provide_module(void *arg, modctl_t *lf)
{
...
/*
* Employees of dtrace and their families are ineligible. Void
* where prohibited.
*/
if (strcmp(modname, "dtrace") == 0)
return;
prototype
for a skeleton provider, sys/cddl/dev/prototype.cdtmalloc
for a simple real provider, sys/cddl/dev/dtmalloc/dtmalloc.cstatic dtrace_pops_t dtmalloc_pops = {
dtmalloc_provide,
NULL,
dtmalloc_enable,
dtmalloc_disable,
...
};
if (dtrace_register("dtmalloc", &dtmalloc_attr, DTRACE_PRIV_USER,
NULL, &dtmalloc_pops, NULL, &dtmalloc_id) != 0)
return;
ctfconvert
: converts DWARF to CTF (.SUNW_ctf
section)ctfmerge
: merges multiple CTF sections--- clock.o ---
ctfconvert -L VERSION -g clock.o
...
--- buildkernel ---
ctfmerge -L VERSION -g -o kernel.full ...
# readelf -S /boot/kernel/kernel|grep -A 1 ctf
...
[42] .SUNW_ctf PROGBITS 0000000000000000 018aaa38
00000000000d86bc 0000000000000000 43 0 4
kgdb
<sys_getpid> push %rbp
<sys_getpid+1> mov %rsp,%rbp
<sys_getpid+4> mov 0x8(%rdi),%rax
<sys_getpid+8> movslq 0xbc(%rax),%rax
dtrace -n 'fbt:kernel:sys_getpid:entry'
int3
replaces first sys_getpid
instructionpush
instruction is emulated in the kernelpush
instruction is restored after dtrace(1) exits<sys_getpid> int3
<sys_getpid+1> mov %rsp,%rbp
<sys_getpid+4> mov 0x8(%rdi),%rax
<sys_getpid+8> movslq 0xbc(%rax),%rax
int $3
in sys/cddl/dev/fbt/x86/fbt_isa.c#define FBT_PATCHVAL 0xcc
brk 0x40d
in sys/cddl/dev/fbt/aarch64/fbt_isa.c#define FBT_PATCHVAL (AARCH64_BRK | AARCH64_BRK_IMM16_VAL)
sbreak
in sys/cddl/dev/fbt/riscv/fbt_isa.c#define FBT_PATCHVAL (RISCV_INSN_BREAK)
dtrace_probe()
do_el1h_sync
-> dtrace_invop()
-> fbt_invop()
-> dtrace_probe()
dtrace_probe()
is calledint $0x3
on x86-64 (like FBT). No ARMv8 or RISC-V support yet.#define FASTTRAP_INSTR 0xcc
int
fasttrap_tracepoint_install(proc_t *p, fasttrap_tracepoint_t *tp)
{
fasttrap_instr_t instr = FASTTRAP_INSTR;
if (uwrite(p, &instr, 1, tp->ftt_pc) != 0)
return (-1);
exec()
calls using proc
provider# dtrace -n 'proc:::exec-success { printf("%s", curpsinfo->pr_psargs); }'
exec-success
SDT probe is defined in sys/kern/kern_exec.c:SDT_PROVIDER_DECLARE(proc);
...
SDT_PROBE_DEFINE1(proc, , , exec__success, "char *");
...
/*
* In-kernel implementation of execve(). All arguments are assumed to be
* userspace pointers from the passed thread.
*/
static int
do_execve(td, args, mac_p)
...
SDT_PROBE1(proc, , , exec__success, args->fname);
nop
s that can be rewritten to eventually enter dtrace_probe()
dtrace_probe()
dtrace_probe()
installed in sys/cddl/dev/sdt/sdt.cstatic void
sdt_load()
{
...
sdt_probe_func = dtrace_probe;
SDT_PROBE
calls sdt_probe_func()
(aka dtrace_probe()
if enabled) in sys/sys/sdt.h#define SDT_PROBE(prov, mod, func, name, arg0, arg1, arg2, arg3, arg4) do { \
...
(*sdt_probe_func)(sdt_##prov##_##mod##_##func##_##name->id, \
(uintptr_t) arg0, (uintptr_t) arg1, (uintptr_t) arg2, \
(uintptr_t) arg3, (uintptr_t) arg4); \
...
D translators:
D scripts:
DTrace testsuite: