403 Forbidden

Disable Functions:
Path : /usr/share/doc/dstat-0.7.2/

Current File : //usr/share/doc/dstat-0.7.2/dstat-paper.txt

= Dstat: pluggable real-time monitoring
Dag Wieers <dag@wieers.com>
$Id$

'This Dstat paper was originally written for LinuxConf Europe that was
held together with the Linux Kernel summit at the University in Cambridge,
UK in August 2007.'

== Introduction
Many tools exist to monitor hardware resources and software behaviour, but few
tools exist that allow you to easily monitor any conceivable counter.

Dstat was designed with the idea that it should be simple to plug in a piece
of code that extracts one or more counters, and make it visible in a way that
visually pleases the eye and helps you extract information in real-time.

By being able to select those counters that you want (and likely those
counters that matter to you in the job you're doing) you make it easier to
correlate raw numbers and see a pattern that may otherwise not be visible.

== A case for Dstat
A few years ago I was involved in a project that was testing a storage cluster
with a SAN back-end using GPFS and Samba for a broadcasting company. The
performance tests that were scheduled together with the customer took a few
weeks to measure the different behaviour under different stresses.

During these tests there was a need to see how each of the components behaved
and to find problematic behaviour during testing. Also, because it involved 5
GPFS nodes, we needed to make sure that the load was spread evenly during the
test. If everything went well repeatedly, the results were validated and the
next batch of tests could be prepared and run.

We started off using different tools at first, but the more counters we were
trying to capture the harder it was to post-process the information we had
collected. What's more, we often saw only after performing the tests that the
data was not representative because the numbers didn't add up. Sometimes it
was caused by the massive setup of clients that were autonomously stressing the
cluster. On other occasions we noticed that the network was the culprit. All in
all, we lost time because we could only validate the results by relating
numbers after the tests were complete and not during the tests.

Complicating the matter was the fact that 5 different nodes were involved
and using the normal command line tools like vmstat, iostat or ifstat (which
only showed us a small part of what was happening) was problematic as each
needed a different terminal. Besides, not all information was interesting.

Eventually Dstat was born, to make a dull task more enjoyable.

After the project was finished I was able to correlate system resources with
network throughput, TCP information, Samba sessions, GPFS throughput,
accumulated block device throughput, HBA throughput, all within a single
interval on one screen for the complete cluster.

== Dstat characteristics
There are many ideas incorporated into Dstat by design, and this section
serves to list all of them. Not all of them may appeal to the task you're
doing, but the combination may make it an appealing proposition nevertheless.

=== History of counters
An important characteristic in line-based tools like vmstat, iostat or
ifstat is the fact that you can compare historical collected data with
new data. This allows you to have a good feeling of how something is
evolving.

Compare this to tools like top or nmon, where data is often being refreshed
and you loose historical information (but in return can provide you with
a lot more information at the same time).

=== Adding unit indication
It was very important that when numbers were compared, they were in the same
unit, and not eg. a different power exponent. The human mind sometimes works
in mysterious ways and more so when working with numbers for hours and hours.
Adding the unit is something very convenient and may reduce the human error
factor.

Additionally, indicating the unit also makes sure that the columns have a
fixed width. Often when using vmstat or other tools, the columns tend to shift
depending on the width of the counter. This makes it very inconvenient to find
counters in the shifted output.

=== Colour highlighting units
After I added colours to help improve indicating units, I noticed that the
colours also helped to show patterns. This of course is very limited,
nevertheless it instantly shows when numbers are flat or changes are taking
place.

IMPORTANT: The colours are arbitrarily chosen. Do not make the mistake to
assume that green means good and red means bad. There is no real meaning to
the colour itself, however a change of colour does mean that a value has gone
over some pre-defined limit.

=== Intermediate updates
During tests, when you choose to see average values over a given time, it can
be useful to see how the averages evolve. Dstat, by default, displays
intermediate updates. This means that if you select to see 10 second averages,
after each second you see the accumulated average over the timespan. *This
means that after 4 seconds with intermediate updates, you see an average
taken over the 4 second timeframe.*

NOTE: This means that the closer you get to the given timeframe (eg. 10 seconds)
the more likely that it nears its final average over that period.

=== Adding custom counters
Dstat was specifically designed to enable anyone to add their own counters in a
matter of minutes. The plugin-based system takes care of displaying, colouring
and adding units to the counters. As a plugin-writer, you only have to focus
on extracting the counters from the kernel (procfs or sysfs), logfiles or
daemons.

=== Selecting plugins and counters
Being able to add custom counters is important, but selecting those counters
that you really need is even more important if you want to correlate counters
and see patterns. Less is more.

NOTE: In fact, Dstat currently does not allow you to select just counters, it
only allows you to select plugins. However, since you can modify or fork a
plugin, you still have the ability to select just those counters you prefer.

=== Exporting to CSV
Having information on screen is one thing, you most likely need some hard
evidence later to make your case. (Why else do all the work?)

Dstat allows to write out all counters in the greatest detail possible to CSV.
By default it also adds the command-line used for generating the output, as
well as a date and time stamp. Since Dstat in the first place is meant for
human-readable real-time statistics, it will by default also display the
counters to screen (unless you _/dev/null_ it).

TIP: Dstat appends to the output file so that you can add tests-results of
different tests to a single file. However, make sure that you tag each test
properly (eg. by using distinct filenames for each different test).

=== Time-plugin included
It may seem a small thing, but having exact time (and date) information for
your counters allows for a completely different usage as well. By adding
simple date and time information, Dstat can be used as a background process in
a screen to monitor the behaviour of your system during the night.

This proves to be very valuable for example, to find offending processes during
nightly tasks or to pinpoint their behaviour to certain events that you cannot
monitor during working hours.

It is also important when you have multiple Dstats running (eg. for nodes in a
cluster) to correlate counters between the outputs.

=== Terminal capabilities
Dstat also takes into account the width and height of your terminal window and
modifies output to fit into your terminal. This, of course, has no effect on
what ends up in the CSV output.

Another (debatable) useful feature is that Dstat will modify the terminal
title to indicate on what system it was run and what options were used.
Especially when monitoring nodes in a cluster, this can be useful, but even in
Gnome finding your Dstat window is handy.

WARNING: Some people however are annoyed by the fact that their distribution
does not reset the terminal title and Dstat therefor messes it up. There is no
way for Dstat to fix this.

== Plugins and counters
When we talk about plugins, we make a distinction between those plugins that
are included within the Dstat tool itself, and those that ship with it
externally. In essence there is no real difference, as the internal plugins
could easily have been created as an external plugin. The basic difference is
that the internal plugins have no dependencies except on procfs.

Having the basic plugins as part of Dstat, makes sure that Dstat can be moved
as a self-contained file to other systems.

=== Internal plugins
The plugins that have been selected to be part of the Dstat tool itself, and
therefor have no dependencies other than procfs, are:

- aio: asynchronous I/O counters
- cpu, cpu24: CPU counters (+-c+ and +-C+)
- disk, disk24, disk24old: disk counters (+-d+ and +-D+)
- epoch: seconds since Epoch (+-T+)
- fs: file system counters
- int, int24: interrupts per IRQ (+-i+ and +-I+)
- io: I/O requests completed (+-r+)
- ipc: IPC counters
- load: load counters (+-l+)
- lock: locking counters
- mem: memory usage (+-m+)
- net: network usage (+-n+ and +-N+)
- page, page24: paging counters (+-g+)
- proc: process counters (+-p+)
- raw: raw socket counters
- swap, swapold: swap usage (+-s+ and +-S+)
- socket: socket counters
- sys: system (kernel) countersA (+-y+)
- tcp: TCP socket counters
- time: date and time (+-t+)
- udp: UDP socket counters
- unix: unix socket counters
- vm: virtual memory counters

For backward compatibility with older kernels there is a cascading system that
selects the most appropriate internal plugin for your kernel. (eg. the
+dstat_disk+ plugin falls back to +dstat_disk24+ and +dstat_disk24old+) At this
moment there is no such system for external plugins.

=== External plugins
This basic functionality is easily extended by writing your own plugins
(subclasses of the python Dstat class) which are then inserted at runtime
into Dstat. A set of 'external' modules exist for:

- battery: battery usage
- battery-remain: remaining battery time
- cpufreq: CPU frequency
- dbus: DBUS connections
- disk-tps: disk transactions counters
- disk-util: disk utilization percentage
- dstat: dstat cputime consumption and latency
- dstat-cpu: dstat advanced cpu usage
- dstat-ctxt: dstat context switches
- dstat-mem: dstat advanced memory usage
- fan: Fan speed
- freespace: free space on filesystems
- gpfs: GPFS IO counters
- gpfs-ops: GPFS operations counters
- helloworld: Hello world dispenser
- innodb-buffer: innodb buffer counters
- innodb-io: innodb I/O counters
- innodb-ops: innodb operations counters
- lustre: lustre throughput counters
- memcache-hits: Memcache hit counters
- mysql5-cmds: MySQL communication counters
- mysql5-conn: MySQL connection counters
- mysql5-io: MySQL I/O counters
- mysql5-keys: MySQL keys counters
- mysql-io: MySQL I/O counters
- mysql-ops: MySQL operations counters
- net-mackets: number of packets received and transmitted
- nfs3: NFS3 client counters
- nfs3-ops: NFS3 client operations counters
- nfsd3: NFS3 server counters
- nfsd3-ops: NFS3 server operations counters
- ntp: NTP time counters
- postfix: postfix queue counters
- power: Power usage counters
- proc-count: total number of processes
- qmail: qmail queue sizes
- rpc: RPC client counters
- rpcd: RPC server counters
- sendmail: sendmail queue counters
- snooze: Dstat time delay counters
- squid: squid usage statistics
- thermal: Thermal counters
- top-bio: most expensive block I/O process
- top-bio-adv: most expensive block I/O process (advanced)
- top-cpu: most expensive cpu process
- top-cpu-adv: most expensive CPU process (advanced)
- top-cputime: process using the most CPU time
- top-cputime-avg: process having the highest average CPU time
- top-int: most frequent interrupt
- top-io: most expensive I/O process
- top-io-adv: most expensive I/O process (advanced)
- top-latency: process with the highest total latency
- top-latency-avg: process with the highest average latency
- top-mem: most expensive memory process
- top-oom: process first shot by OOM killer
- utmp: utmp counters
- vm-memctl: VMware guest memory counters
- vmk-hba: VMware kernel HBA counters
- vmk-int: VMware kernel interrupt counters
- vmk-nic: VMware kernel NIC counters
- vz-cpu: OpenVZ CPU counters
- vz-io: I/O usage per OpenVZ guest
- vz-ubc: OpenVZ user beancounters
- wifi: WIFI quality information

=== Most-wanted plugins
Hoping someone interested reads this document, I added a few plugins that
would be ``very nice'' to have but are currently lacking:

- slab: needs a VM expert to make sense out of the vast amount of data
- xorg: need information on how to get X resources, would be nice
to see evolution of X resources over time
- samba: lacking information to get counters from Samba without
forking smbstatus every second
- snmp: could be useful to relate counters from different systems
in a single Dstat
- topx: display the most expensive X application(s)
- systemtap: connecting Dstat to systemtap counters

Creative souls with other ideas are welcome as well !

== Using Dstat
Central to the Dstat command line interface is the selection of plugins. The
selection and order of options influence the Dstat output directly.

=== Enabling plugins
The internal plugins have short and/or long options within Dstat, eg. +-c+ or
+--cpu+ will enable the cpu counters.

The external plugins are enable by a long option including their name,
eg. +--top-cpu+

The following examples will enable the time, cpu and disk plugins, and are
equal.

----
dstat -tcd
dstat --time --cpu --disk
----

=== Total or individual counters
Some of the plugins can show both total values or individual values and
therefor have an extra option to influence this decision.

----
dstat -d -D sda,sdb
dstat -n -N eth0,eth1
dstat -c -C total,0,1
----

You can show both the individual values and total values as follows:

----
[dag@horsea ~]$ dstat -d -D total,hda,hdc
-dsk/total----dsk/hda-----dsk/hdc--
read writ: read writ: read writ
1384k 1502k: 114k 1332k: 81k 359B
0 44k: 0 44k: 0 0
0 0 : 0 0 : 0 0
----

The special +-f+ or +--full+ option allows to select individual counters by
default, and can be overruled by +-C+, +-D+, +-I+, +-N+ or +-S+.

=== Influencing output
Dstat has a few more options to influence its output. With the +--nocolor+
one can disable colours. The +--noheaders+ option disables repeating headers.
The +--noupdate+ option disables intermediate updates. The +--output+ option
is used for writing out to a CSV file.

=== Plugin search path
Dstat looks in the following places for plugins. This allows a user without
root privileges to use some extra plugins.

- ~/.dstat/
- <binarypath>/plugins/
- /usr/share/dstat/
- /usr/local/share/dstat/

The option +--list+ shows the available plugins and their location in the
order that the plugin search path is used.

NOTE: Plugins are named +dstat_<name>.py+.

== Use-cases
Below are some use-cases to demonstrate the usage of Dstat.

WARNING: The following examples do not look as nice as they do on screen
because this document is not printed in colour (and I did not prepare it in
colour :-)).

=== Simple system check
Let's say you quickly want to see if the system is doing alright. In the past
this probably was a +vmstat 1+, as of now you would do:

----
dstat -taf
----

.Sample output
----
[dag@rhun dag]$ dstat -taf
-----time----- -------cpu0-usage------ --dsk/sda-----dsk/sr0-- --net/eth1- ---paging-- ---system--
date/time |usr sys idl wai hiq siq| read writ: read writ| recv send| in out | int csw
02-08 02:42:48| 10 2 85 2 0 0| 22k 23k: 1.8B 0 | 0 0 |2588B 2952B| 558 580
02-08 02:42:49| 4 3 93 0 0 0| 0 0 : 0 0 | 0 0 | 0 0 |1116 962
02-08 02:42:50| 5 2 90 0 2 1| 0 28k: 0 0 | 0 0 | 0 0 |1380 1136
02-08 02:42:51| 11 6 82 0 1 0| 0 0 : 0 0 | 0 0 | 0 0 |1277 1340
02-08 02:42:52| 3 3 93 0 1 0| 0 84k: 0 0 | 0 0 | 0 0 |1311 1034
----

NOTE: The +-t+ here is completely optional and generally wastes space. But
often you are not monitoring for 10 seconds but rather measure in minutes or
hours. Having a general idea on what timescale counters have been averaged is
nevertheless interesting.

=== What is this system doing now ?
I often run both the +dstat_top_cpu+ and +dstat_top_mem+ programs on a system,
just to see what a system is doing. Having a quick look at what application
is using the most CPU over a few minutes and to see what the general usage
of memory is of the top application gives away a lot about a system.

.Sample output
----
[dag@horsea dag]$ dstat -c --top-cpu -dng --top-mem
----total-cpu-usage---- -most-expensive- -dsk/total- -net/total- ---paging-- -most-expensive-
usr sys idl wai hiq siq| cpu process | read writ| recv send| in out | memory process
9 2 80 9 0 0|kswapd 0| 123k 164k| 0 0 |9196B 18k|rsync 74M
2 3 95 0 0 0|sendmail 1| 0 168k|2584B 39k| 0 0 |rsync 74M
18 3 79 0 0 0|httpd 17| 0 88k|5759B 118k| 0 0 |rsync 74M
3 2 94 1 0 0|sendmail 1|4096B 0 |2291B 4190B| 0 0 |rsync 74M
2 3 95 0 0 0|httpd 1| 0 0 |2871B 3201B| 0 0 |rsync 74M
10 7 83 0 0 0|httpd 13| 0 0 |2216B 10k| 0 0 |rsync 74M
2 2 96 0 0 0| | 0 52k| 724B 2674B| 0 0 |rsync 74M
----

=== What process is using all my CPU, memory or I/O at 4:20 AM ?
Imagine the monitoring team notices strange peaks, a system engineer got a
worthless message, the system was swapping extensively, a process got killed.

Something indicates the system is doing something unexpected but what is
causing it and why ? As of now you can do:

----
screen dstat -tcy --top-cpu 120
screen dstat -tmgs --top-mem 120
screen dstat -tdi --top-io 120
----

to see what process is using the most CPU, the most memory and the most I/O
resources.

And hopefully one day we can do:

----
dstat -tn --top-net 120
dstat -tn --top-x 120
----

Leave it running during the night and in the morning you can see the light.

=== How much ticks per second on my kernel ?
In some cases it can be useful to see how many ticks (timer interrupts) your
kernel is producing. With older kernels this is a fixed number (usually 100,
250 or 1000) but on newer kernels the number can be dynamic.

Also on VMware virtual machines, the number of ticks can cause clock issues,
so in that case if you want to see what is happening, you can simply do:

----
dstat -ti -I0 --snooze --debug
----

Dstat nowadays can also detect lost ticks (when the number of ticks do not
match the time progress. This is useful to correlate VM issues with other
problems.

////
=== Monitoring memory consumption of a process over time
Now, I have twice used Dstat to verify memory usage. And I have concluded that
2 programs have severe memory leaks. One, unsurprisingly, is Firefox, the
other sadly is wnck-applet (yes, unfortunately).

Now Dstat is currently not really useful for specifying your own process to
monitor (unless you dig into the module, which is easier than one might
expect). But I am already anticipating Pstat, which is a Dstat but for
process-related counters.