Minutes from 2020-07-06

From collectd Wiki
Revision as of 09:25, 7 July 2020 by Octo (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Attendees: Slawomir Strehlau, Octo, Pawel Zak, Matthias Runge, Robert, Piotr Dresler, Kamil Waitrowski, Sunku

Collectd 6.0 & Open Metrics work (by Octo)

  • Changed the core data structures in the collectd-6.0 branch that caused good amount of work
    • Data structures are a C adoption of the Prometheus protobuf, it has structures called "metric family" that has metric name and metric type (indicating gauge or counter, etc). Metric family can contain many metrics where metrics hold actual value, sampling time & interval (interval being an addition to prometheus format), holds labels, and meta data. Implications:
    • The daemon is taking metric family instead (plugin_dispatch_metric_family())
    • It allows plugins like the CPU and memory plugins to send related metrics to the daemon in one call. If the timestamp is zero, they all receive the same timestamp, which simplifies aggregation / alignment.
      Currently plugins can populate the time field to the function they sent, if not dispatch function will set the timestamp if its unset. This happens on each value list individually, resulting in a difference in millisecond range between related metrics.
    • Currently three differences between the implemented data structures and OpenMetrics / Prometheus:
      • Openmetrics uses timestamp in milliseconds since epoch, collectd currently uses the more accurate cdtime_t data structure.
        Suggestion: store times in microseconds instead.
      • metric_t contains an "interval" field that is not present in Prometheus.
      • metric_t contains a "meta" field that is not present in Prometheus. Current thinking is to make this internal-only again, for example to mark metrics received via the network.
    • Functions that work today already:
      • Dispatching / routing of metrics to write plugins.
      • The metrics cache: converting counters to rates, storing meta data with metric identifiers.
    • The CPU, Write Stackdriver, and Write Log plugins have been ported.
    • Formatting in "Graphite", JSON, and Stackdriver works.
    • A few unit tests Still need to updating.
    • Code that is looking up metrics with partial matches, e.g. thresholds, aggregation, is not written yet and needs a solid design.
  • Migrating plugins
    • 173 plugins potentially build in today's collectd, some of them are barely used and could be removed, for example Ascent plugin, XMMS plugin.
    • Quite a few decisions to be taken. For example, the CSV plugin is easy to migrate but needs to figure out how to map metric labels to a filesystem path.
    • Each write plugins needs to be updated to accept the new metric_family_t.
    • Many read plugins potentially build due to the compatibility layer but should be looked at to make use of the new naming schema.
    • Plugins that allow users to map inputs to plugin instance or type instance need special care, e.g. PostgreSQL plugin.
    • Suggestion: Let collectd 6 ship with fewer plugins than collectd 5. Let community migrate the plugins they depend on and drop unused plugins.
    • Suggestion: Move infrequently used plugins into a separate repository, similar to the go-plugins and python-plugins repositories.
    • AI(octo): put together a single document on design decisions, trade off with one huge document vs. many small ones is that all the info is consolidated into one place.
  • Next steps
    • Not quite ready for everyone to jump in port plugins, but not far off. Would be great if folks could ear-mark some time.
    • octo will clean up git history, once it's in the collectd-6.0 branch more people can migrate plugins. Then will ask everyone to contribute – either collaborate in a Google Doc or wiki.
    • Migrating a few plugins to get a feel for the API and make improvements.
      • Currently migrating the Memory plugin to figure out an elegant way to create a metric family with many similar metrics.
  • Misc
    • Ensuring backward compatibility is subtle and expect changes
    • Release end of the year might be realistic, depends on how many are contributing to the migration effort and whether or not we migrate all 173 plugins.

Porting effort for 6.0

  • Continued maintaining 4.x branch relatively long time, eventually stopped maintaining about 5 years
  • But 4.11 bug fixes kept coming in for about 6 months after 5.0 release when it ended up dead
  • So 6.0 will be released but 5.x will be still supported taking only major security releases.
  • For 6.0 release, create separate directory for not as frequently used plugins (ex. teamspeak plugin, etc.). CPU/memory plugins could be with core plugin list
  • Is 6.0 release without all plugins need to be ported?
    • Need to be open for people to port additional plugins to 6.0 version, only accept plugins where people are invested in porting the plugins
    • Florian will send a doc for 6.0 features being written and to be done

Go collectd changes for 6.0

  • 6.0 changes will break go-collectd framework. Will maintain different branches, will check to see if the packages could be kept backward compatible
  • API needs to be stabilized, currently porting key plugins in C for 6.0, need to come up with stable API for go-collectd. Once there, don’t expect huge changes, then 6.0 branch in go-collectd could be created which passes metrics family. Go data structures with plain text protocol need to be updated

Interns on distribution metrics

  • 3 interns started today in Google (2nd or 4th semester of bachelors), looking to research solution and write design document on distribution metrics. They will present doc in next call. Working on 6.0 branch, as a new feature in 6.0.
  • Example usage of distribution metrics: considering latency of web service metrics, 1 metric every 10 sec, every metric with 2000 requests, naïve approach is calculate average latency of all requests happening in 10 seconds. If we want to use latency as Service Level Indicator, we may want to calculate 95th percentile of all metrics. This is what distribution metrics allows us to do, a distribution over certain range of metrics.

Go-collectd features