Frequently asked questions

The system statistics collection daemon

These are real frequently asked questions, not some questions we though of while sitting by ourselves and having a glass of wine. As a consequence, the questions are sometimes very specific and the answers sometimes require some knowledge about advanced topics. If you’re looking for stuff like “What does collectd do?” or “How do I enable plugin foo?”, please go to the appropriate place, for example the documentation page.

Diagnostic output

It doesn’t work. Where can I find diagnostic output?

In order to get any output at all, you need to load a log plugin. The two main log plugins are the LogFile and SysLog plugins. We recommend that loading one of those plugins is the first thing you do in your config file, i.e. put the LoadPlugin line at the very top. If no log plugin is loaded, collectd will write to STDERR. After the daemon has forked to the background, you won’t be able to see this output anymore, though.

Ping plugin and raw sockets

I try to use the Ping plugin, but keep getting the message “`ping_host_add' failed.”. What’s the matter?

In order to generate ICMP packets one needs to open a so called “RAW socket”. On most UNIX systems only the superuser (root) may open such sockets. In addition, some virtualization environments, such as VServer and Solaris Zones have been reported to cause some trouble.

Multicast traffic

Who receives the multicast traffic?

That entirely depends on your network setup. By default collectd uses “site local” addresses, that should not be routed to outside your AS. If that’s really the case is up to you.

Build with dependencies

How do I use --with-librrd?

If you installed libraries in a non-standard (or non-system) path you need to specify them when running the configure script. Otherwise it will not find them and build the binaries without linking against the library. You need to set the PATH as given to the --prefix option when compiling the library. The script actually looks for the two subdirectories PATH/include and PATH/lib, so check for their existence if things don’t work. If, for example, you installed RRDTool in /opt/rrdtool-x.y.z you need to run configure like this:

./configure --with-librrd=/opt/rrdtool-x.y.z

Semantic versioning

What do the version numbers mean?

The version numbers consist of three numbers: The major- and minor-number and the patchlevel.

Versions with different major-numbers are basically not compatible. This means that the definitions of RRD files or config-options have been changed or, in general, that the user has to do something in addition to install the new version. This is not nice and avoided when possible, but sometimes necessary to prevent old mistakes to become ancient mistakes. We try to provide migration scripts, though, to make a switch as easy as possible. See the V4 to v5 migration guide for details.
Versions with differing minor-numbers are backwards compatible, i.e. you can replace the lower version with the higher one and everything should still work. This means that features are added, but not removed or changed and that the default behavior does not change.
Versions with different patchlevels are both, forward- and backwards-compatible, because no new features have been introduced. The only difference between the two versions is one or more bugfixes, so you should generally install the higher version of the two.

Enabling plugins

I enabled the foo plugin using --enable-foo but now the build process fails. What’s wrong?

This is the expected behavior. The confgure script determines which libraries are installed and what compiler and linker flags are required to build applications using that library. Based on those results the plugins with met dependencies are enabled – all other plugins are be disabled. If a plugin is displayed as disabled, that is because its dependencies are not met.

The best way to compile with a specific plugin is to install the missing dependencies and re-run the configure script. You can force a plugin to be build using the --enable-foo argument to the configure script, but the you need to know exactly what you are doing. If you do this you’re out in the dark, cold woods and totally on your own!

Debian package is missing dependencies

I installed the Debian package of collectd. Now I get the error “lt_dlopen (foo.so) failed: file not found” – but the file exists!

The Debian and Ubuntu packages of collectd contain all plugins that are available for the platform you’re using. However, they do not contain a Dependency on all required libraries for all plugins, because that would be a lot of packages. In all likelihood you’re missing one of the required libraries.^[*] Take a look at the file /usr/share/doc/collectd-core/README.Debian.plugins which lists all the required packages for each plugin. You can also use

ldd /usr/lib/collectd/*foo.so*

to figure out which shared object is missing and go from there.

[*] Yes, the error message “file not found” is very confusing. It is an automatically stringyfied version of the error code returned by lt_dlopen(). Versions of collectd that were released after February 2011 contain a more detailed error message for this case.

The intuitive way of organizing the collectd package would be to put plugins with special dependencies in separate packages which have a dependency on the library that’s required for the plugin. Unfortunately, consensus in the Debian community was that this would create too many packages. All the dependencies are listed in a field called Recommendation which is a sort of soft dependency. Since Recommendations are installed in the default setting of APT, this way is deemed good enough for the average user.

Static libs

The build process fails with “relocation R_X86_64_32 against `a local symbol’ can not be used when making a shared object; recompile with -fPIC”. What’s wrong?

Many plugins have to be linked against libraries. A few of them (currently iptables, netlink and nut are known to be affected) link against libraries that are only available as “static libraries” in many distributions. Most distributions (e. g. Debian and SuSE GNU/Linux) do not compile static libraries with the “-fPIC” option. Thus they cannot be linked with shared objects compiled with “-fPIC”. Some architectures (among them i386) do not seem to care about that and handle it in some (probably magic) way. However, other architectures (mostly 64bit like amd64 or hppa) cannot handle that and thus the compiler aborts with the error message mentioned above.
To fix this issue, you need a version of the static library compiled with “-fPIC” (or a shared library). Ask your distributor to provide a suitable version of the library or compile it yourself.
For more detailed information please refer to:

Solaris 32bit support

Solaris support is broken! The build aborts! Help!

Versions 4.4.5 and 4.5.2 include fixes in the build system so the problems described below should be handled much more gracefully now.
There are two known issues with Solaris, but both can be fixed relatively easy:
If you build a 32bit binary, the configure script will (try to) enable LFS. This will result in an error which looks somehow like this:

config.h:832:1: error: "_FILE_OFFSET_BITS" redefined

Also, the swap-plugin has some problems of it’s own with this:

swap.c:197: warning: implicit declaration of function 'swapctl'  
swap.c:197: error: 'SC_AINFO' undeclared (first use in this function)

The problem is that Solaris’ swap interface is not available to 32bit applications. The solution is to build a 64bit binary! If you build a 64bit binary, LFS is not needed and the swap plugin works as intended. To do this, pass the -m64 flag to the compiler (assuming you’re using the Sun C compiler).
Another problem is that by default Sun defines a version of getgrnam_r that isn’t POSIX-compatible. To enable POSIX-compatibility pass the _POSIX_PTHREAD_SEMANTICS define to the compiler. This define is set automatically in versions 4.4.5, 4.5.2 and later.
Putting all together you need to pass the following flags to the configure-script:

# Sun CC
./configure CFLAGS="-m64 -mt -D_POSIX_PTHREAD_SEMANTICS"

Please note that we only test the Sun C compiler ourselves, but GCC may work, too. When using the GCC you need to substitute the -mt flag with the -pthreads flag. So if you use GCC the above invokation of ./configure becomes:

# GCC
./configure CFLAGS="-m64 -pthreads -D_POSIX_PTHREAD_SEMANTICS"

Thanks to Christophe Kalt for sharing his insights :)

Split metrics

Why do many plugins, for example the CPU plugin, split related metrics across so many files? Can I change that?

The short answer is: We do this in order to be able to provide strict backwards compatibility. Writing all the details to a single file is not possible; for the CPU plugin, set the ReportByState and ReportByCpu options to false for an aggregated output.

The long answer and explanation of the short answer is: collectd runs on a variety of operating systems. Each operating system has it’s own method for accounting CPU states, memory consumption, swap usage, and so on. If all these data sources where in one data set, every new supported operating system or any addition to an already supported operating system would mean that we need to modify the data set. This cannot be done without breaking backwards compatibility.

To give you a few examples: Sometime in mid-2.6 the Linux kernel added some Xen-patches which provided a new CPU state: “steal time”. When adding support for BSD systems we had to add “wired” memory. NFSv4 added some new procedures that NFSv3 didn’t have, etc pp.

Changing the layout of the data is not just a matter of changing the types.db file. That file describes the layout of the data submitted by plugins. The plugins don’t need it - they know what data they submit. It’s needed by the daemon and writing plugin to know how to store the data. If you mess with the file without knowing what you do, you will most likely end up with the data not being collected at all anymore.

Going forward, we intend to push the “one data source per file” rule even more and, eventually, make it the only supported mode of operation. If you are writing extensions for collectd, it would be best to bear this in mind.

collection.cgi is incomplete

Why doesn’t collection.cgi draw foo graphs correctly?

That script is meant as a starting point for own developments, not as a ready to use web frontend for RRD files written by collectd.

It is just an example, because it’s not really usable as it is. And it’s not really useable, because we are UNIX developers and don’t enjoy doing web stuff much. Working on the daemon is just so much more fun.. ;) So in the best of free / open source traditions: Patches welcome!

There are alternatives, though. We’ve heard from various people using Cacti to render the graphs. Sergiusz Pawlowicz of the BBC has written CollectGraph, a macro for the MoinMoin wiki. And of course there’s drraw.

CPU jiffies

Why don’t the CPU states sum up to 100%?

By default, the CPU plugin does not collect the CPU usage in percent, but in “jiffies”. If you prefer a percentage, set the ValuesPercentage option to true.

A jiffy is the time-unit which the scheduler in the operating systems uses to manage run times of applications. Under Linux, the default configuration is to have 100 jiffies per second, which leads many users to believe they’re getting a percentage. You can, however, configure your kernel at compile time to use 250 or 1000 jiffies per second, usually resulting in a more responsive system but IO-throughput is decreased. Especially on busy systems, virtual systems and systems with a “tickless kernel” there may not always be the exact number of intended jiffies in one second, resulting in the variance you’ve notice in the graphs.

That you see this issue in collectd but not in other similar tools is, in many cases, due to the fact that collectd collects data so frequently. Over the timespan of, say, five minutes these variations even out, but the alleged percentages are, in fact jiffies.

Network encryption

Is network traffic encrypted or signed?

Yes, starting with version 4.7.0 you can either sign the traffic using a Hashed Message Authentication Code (HMAC) or encrypt the traffic. Please refer to the Network plugin wiki page for details.

Value too old

I get frequent errors that a “value is too old”. What’s this about?

The complete error message usually looks like this:

[2009-05-06 14:03:05] uc_update: Value too old: name = device.domain.tld/snmp/frequency-output; value time =
1241611385; last cache update = 1241611385;

When adding a new value to the internal cache, the timestamp on that value is checked against the timestamp on the last value with the same name that was added to the cache. The error message informs you, that the value already in the cache was newer or as new as the value that should have been added. In the example above, a value for device.domain.tld/snmp/frequency-output should be added, but the current timestamp (1241611385) is the same as the timestamp already present in the cache, i.e. a duplicate.

The most common source of this is that somehow two values with the same identifier (name) are reported. One frequent reason for this is that two hosts report data using the same host name and send it to a central server. If the “last cache update time” increases with each message, this is very likely that case. You can use Wireshark (1.4 or later) to analyze and filter the collectd network traffic and find out from which IP addresses the duplicate values originate. The second most common reason is a misconfiguration of generic plugins, such as the SNMP plugin.

A similar variant of the above problem is that the daemon is running twice on the same host. You can use the ps command to check if this is the case.

These errors may also be caused by a plugin being loaded twice. You can check if each plugin is loaded only once by checking the LoadPlugin lines:

grep -i LoadPlugin /etc/collectd/collectd.conf | egrep -v '^[[:space:]]*#' | sort | uniq -c

Another common cause is that time on the client jumps backwards. This may happen due to a weekly ntpdate forcefully setting the time, for example. Virtual hosts often have problems providing a steady wallclock time, but usually they have jumps forward (causing gaps). It might be worth investigating nonetheless.