You can find the answers to these questions on the FAQ page.
  • It doesn't work. Where can I find diagnostic output?
    Version 3.* writes warnings and error messages using the syslog(3) facility. Depending on your system the syslog-daemon writes these messages to files and/or sends them to another host. On most GNU/Linux distributions the place to look at is either /var/log/syslog or /var/log/messages.
    Version 4.0 and later comes with the logfile and syslog plugins which can be used to write status messages to a file or send it to the syslog daemon.
  • Some lines of the config seem to be ignored..?
    Yes, that's a known bug. You probably have one or more white spaces at the end of the lines being ignored.
    This is a bug in the library used by collectd 3.* to parse the configfile. Versions 4.0 and later use a different library and don't have this problem.
  • Can I adjust the interval in which data is collected?
    Yes, since version 3.9.0 this can be set at compile-time. Keep in mind, though, that this will change the layout of the generated RRD-files. Also, clients and servers should have the same setting here to avoid interesting results.
    Version 4.0 allows this setting to be adjusted in the configfile.
  • I try to use the ping-plugin, but keep getting the message "`ping_host_add' failed.". What's the matter?
    In order to generate ICMP packets one needs to open a so called "RAW socket". On most UNIX systems only the superuser (root) may open such sockets.
    In addition, some virtualization environments, such as VServer and Solaris Zones have been reported to cause some trouble.
  • Who receives the multicast traffic?
    I don't know. That entirely depends on your network setup. By default collectd uses "site local" addresses, that should not be routed to outside your AS. If that's really the case is up to you.
  • What does "Invalid value for config option `Mode': `Local'" mean?
    Is means that the mode "Local" is not available. Most likely the "librrd" library wasn't found. If you want to write to RRD-files install "librrd" or, if you already did that, use the --with-rrdtool option of the ./configure-script to point to the right direction.
  • How do I use --with-rrdtool?
    If you installed libraries in a non-standard (or non-system) path you need to specify them when running the configure script. Otherwise it will not find them and build the binaries without linking against the library.
    You need to set the PATH as given to the --prefix option when compiling the library. The script actually looks for the two subdirectories PATH/include and PATH/lib, so check for their existence if things don't work. If, for example, you installed RRDTool in /opt/rrdtool-x.y.z you need to run configure like this:
    $ ./configure --with-rrdtool=/opt/rrdtool-x.y.z
  • The apache-plugin reports the following error: apache: curl_easy_perform failed: Failed writing body. What's wrong?
    The response received was too big and didn't fit into the buffer. Check the URL-option in the configfile. Especially check that the URL ends in "?auto": collectd requires the machine readable output generated by the Apache-plugin mod_status and will not work with anything else.
  • What do the version numbers mean?
    The version numbers consist of three numbers: The major- and minor-number and the patchlevel.
    • Versions with different major-numbers are basically not compatible. This means that the definitions of RRD-files or config-options have been changed or, in general, that the user has to do something in addition to install the new version. This is not nice and avoided when possible, but sometimes necessary to prevent old mistakes to become ancient mistakes. We try to provide migration scripts, though, to make a switch as easy as possible. See the v3 to v4 migration guide for details.
    • Versions with differing minor-numbers are backwards compatible, i. e. you can replace the lower version with the higher one and everything should still work. This means that features are added, but not removed or changed and that the default behavior does not change.
    • Versions with different patchlevels are both, forward- and backwards-compatible, because no new features have been introduced. The only difference between the two versions is one or more bugfixes, so you should generally install the higher version of the two.
  • I enabled the foo plugin using --enable-foo but now the build process fails. What's wrong?
    Since version 4.0.0 a server process doesn't need to load the plugins from which data should be received - in contrast to versions 3.*. This means, that plugins with unmet dependencies no longer have any purpose. So, we moved dependency checking into the configure script, starting with version 4.1.0. I. e. the configure script now automatically disables all plugins with unmet dependencies and enables all plugins whose dependencies are met.
    So, if a plugin is displayed as disabled, it's dependencies are not met. The normal way to get a plugin compiled is to install the missing dependencies and re-run the configure script.
    You can force it to be build using --enable-foo, but you need to know exactly what you are doing. If you do this you're out in the dark, cold woods and totally on your own!
  • The build process fails with "relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC". What's wrong?
    Many plugins have to be linked against libraries. A few of them (currently iptables, netlink and nut are known to be affected) link against libraries that are only available as "static libraries" in many distributions. Most distributions (e. g. Debian and SuSE GNU/Linux) do not compile static libraries with the "-fPIC" option. Thus they cannot be linked with shared objects compiled with "-fPIC". Some architectures (among them i386) do not seem to care about that and handle it in some (probably magic) way. However, other architectures (mostly 64bit like amd64 or hppa) cannot handle that and thus the compiler aborts with the error message mentioned above.
    To fix this issue, you need a version of the static library compiled with "-fPIC" (or a shared library). Ask your distributor to provide a suitable version of the library or compile it yourself.
    For more detailed information please refer to:
  • Solaris support is broken! The build aborts! Help!
    There are two known issues with Solaris, but both can be fixed relatively easy:
    If you build a 32bit binary, the configure script will (try to) enable LFS. This will result in an error which looks somehow like this:
    config.h:832:1: error: "_FILE_OFFSET_BITS" redefined
    Also, the swap-plugin has some problems of it's own with this:
    swap.c:197: warning: implicit declaration of function 'swapctl'
    swap.c:197: error: 'SC_AINFO' undeclared (first use in this function)
    The solution is to build a 64bit binary! If you build a 64bit binary, LFS is not needed and the swap plugin works as intended. To do this, pass the -m64 flag to the compiler (assuming you're using the Sun C compiler).
    Another problem is that by default Sun defines a version of getgrnam_r that isn't POSIX-compatible. To enable POSIX-compatibility pass the _POSIX_PTHREAD_SEMANTICS define to the compiler.
    Putting all together you need to pass the following flags to the configure-script:
    # Sun CC
    $ ./configure CFLAGS="-m64 -mt -D_POSIX_PTHREAD_SEMANTICS"
    Please note that we only test the Sun C compiler ourselves, but GCC may work, too. When using the GCC you need to substitute the -mt flag with the -pthreads flag. So if you use GCC the above invokation of ./configure becomes:
    # GCC
    $ ./configure CFLAGS="-m64 -pthreads -D_POSIX_PTHREAD_SEMANTICS"
    Thanks to Christophe Kalt for sharing his insights :)
  • Why is the CPU usage split up in so many files? Can I change that?
    The short answer is: That is because otherwise backwards compatibility would be impossible and you would have to re-create your files from scratch regularly. And, "no".
    The long answer and explanation of the short answer is: collectd runs on a variety of operating systems. Each operating system has it's own method for accounting CPU states, memory consumption, swap usage, and so on. If all these data sources where in one data set, every new supported operating system or any addition to an already supported operating system would mean that we need to modify the data set. This cannot be done without breaking backwards compatibility.
    To give you a few examples: Sometime in mid-2.6 the Linux kernel added some Xen-patches which provided a new CPU state: "steal time". When adding support for BSD systems we had to add "wired" memory. NFSv4 added some new procedures that NFSv3 didn't have, etc pp.
    That interface traffic has two data sources is different, because every operating system will account received and transmitted bytes. Likewise for the system load: The 1, 5, and 15 minute averages have been like that for ages and it's very unlikely that any weird UNIX does this different.
    Changing the layout of the data is not just a matter of changing the types.db file. That file describes the layout of the data submitted by plugins. The plugins don't need it - they know what data they submit. It's needed by the daemon and writing plugin to know how to store the data. If you mess with the file without knowing what you do, you will most likely end up with the data not being collected at all anymore.
  • Why doesn't collection.cgi draw foo graphs correctly?
    That script is meant as a starting point for own developments, not as a ready to use web frontend for RRD files written by collectd.
    It is just an example, because it's not really usable as it is. And it's not really useable, because we are UNIX developers and don't enjoy doing web stuff much. Working on the daemon is just so much more fun.. ;) So in the best of free / open source traditions: Patches welcome!
    There are alternatives, though. We've heard from various people using Cacti to render the graphs. Sergiusz Pawlowicz of the BBC has written CollectGraph, a macro for the MoinMoin wiki. And of course there's drraw.

Inside the rrdtool plugin

The rrdtool plugin is one of collectd's most complex plugins. The reason for this is, that it has been tuned to work well in big setups, where updating RRD-files causes serious IO-problems. A detailed descriptioin of the problem can be found in the Tuning RRDTool for performance article in the RRDTool wiki.

What is IO-hell?

As noted above, updating RRD-files is IO intensive, because only very little data is written with every update and the places that are accessed are not sequential. As long as all RRD-files (or at least the relevant parts of all RRD-files) fit into memory, the operating system's cache does a good job and usually IO is not a problem. If you have more files, you'll run into problems because HDDs aren't good for random access and from what we hear NAND-flash based SSDs aren't there yet, either.

All these tiny, non-sequential IO-operations are sometimes referred to as "IO-hell". Due to the way hard disks work, they have a very hard time with such an access pattern and their transfer rate will drop to maybe 2 MByte/s - if you have good hardware.

Of course, one could increase the interval in which data is collected, but nobody wants that..

Surviving IO-hell

Instead of increasing the interval in which data is collected, we increase the interval in which data is written to disk.

Updating one value in an RRD file writes 8 bytes of data to the file. (We'll neglect the data that's written to the head of the RRD file, because there's nothing we can do about that.) To change these 8 bytes in the file, one block (512 bytes) has to be read from disk, updated, and written to disk again. Assuming that the machine doesn't have enough memory to hold all the files, updating this file 60 times requires 60 read and write operations.

Updating 60 values at once writes 480 bytes to the file. For this, one or two blocks need to be read from disk and written back, resulting in a maximum of two read and write operations. And the second block is right behind the first one, reading/writing that one doesn't really count because it's an sequential access and therefore very fast.

Schematic overview of the rrdtool plugin
Schematic overview of the rrdtool plugin

To cache the values in memory, the plugin uses a self-balancing binary search tree. Each node corresponds to an RRD file and holds the values that have not yet been written to the file. When a new value is added to a node, the plugin compares the timestamp on the oldest value with the timestamp on the value which is currently being inserted. If the difference (the "age" of the node) is too high, the node is put in the "update queue".

If a value is received for a node that's already in the update queue, the node in the queue will be updated, so that the pending write operation will include this new value as well. It's not implemented like this, but think of this as if a node is either in the cache or in the queue, but not in both at the same time.

A separate "queue thread" dequeues values from the update queue and writes them to the appropriate RRD file.

If values are enqueued to the update queue at a higher rate than the queue thread can dequeue them and write to RRD files, new values go into nodes already enqueued and multiple values will be combined in one update of the RRD file. So even if you hardware can't keep up with the amount of data you want to write to disk, collectd can and will act as a dynamically growing buffer between your values and the RRD files on disk.

Escaping IO-hell

As it is right now, your system can now handle almost arbitrarily large volumes of data, but the queue thread will run constantly and believe me, it's very very good at what it's doing. Your system will be so busy writing to RRD files, you won't be able to use it for anything else. Generating graphs from the RRD files on such a system is no fun.

And even if the queue thread is not running constantly, for example because you have set the timeout to a high value, all values tend to reach the right "age" at the same time. Imaging that all updates were evenly distributed over the five minutes after which you write the values to disk. In the morning the backup will run and IO will be a lot slower. The update queue will grow and when the backup is done most values will reach the timeout age at the same time.

A solution to this problem is to throttle the speed at which RRD files are written. This isn't exactly what the rrdtool plugin does, because actually the rate at which nodes are dequeued from the update queue is limited, but it basically has the same effect. So if, for example, your hardware can handle 100 updates per second (this number is not unrealistically low!), throttle the plugin to 50 updates per second. It will take a little longer for all the values to be written on disk (assuming the timeout is set to a high value), but your system will remain usable.

Master IO-hell

If in the previous paragraph you've though "What, it will take even longer until I can see the data?!?", you've spotted the problem with the solution so far: What good is a very high resolution, if it takes an hour for the data to actually show up in graphs? Not at all, of course, and this is where the last concept comes into play: Flushing.

The idea behind "flushing" is, that the number of values is much higher than the number of times someone actually looks at the graph generated from that data. Why should the daemon write to the file system every ten seconds (8640 times a day!), if the graph is only looked at twice a day? Wouldn't it make much more sense to write to the file system only when needed? This is what we mean with "flushing".

The rrdtool plugin can be told to write all values for one RRD file to disk right now (to "flush" the values). If the "FLUSH" request for a node is received, it is put into the "flush queue". If the node was already in the update queue, it is removed from there and enqueued in the flush queue instead. The queue thread handles the flush queue with absolute priority, i. e. nodes are only dequeued from the update queue if the flush queue is empty. This is the reason why dequeuing from the update queue can be limited: All files that were flushed are written to disk at the highest possible speed, not regarding the "speed limit" imposed on the update queue.

To send the "FLUSH" command to the rrdtool plugin, load the unixsock plugin. Connect to the UNIX domain socket it opens and send the command as described in collectd-unixsock(5). The sample graphing script in contrib/, collection3, can automatically send the FLUSH command before drawing a graph. If you need a pointer how to send the command with your own graphing solution, take a look at that script.