plugin/health/README.md

# health

## Name

*health* - enables a health check endpoint.

## Description

By enabling *health* any plugin that implements
[healt.Healther interface](https://godoc.org/github.com/coredns/coredns/plugin/health#Healther)
will be queried for it's health. The combined health is exported, by default, on port 8080/health .

## Syntax

~~~
health [ADDRESS]
~~~

Optionally takes an address; the default is `:8080`. The health path is fixed to `/health`. The
health endpoint returns a 200 response code and the word "OK" when this server is healthy. It returns
a 503. *health* periodically (1s) polls plugins that exports health information. If any of the
plugins signals that it is unhealthy, the server will go unhealthy too. Each plugin that supports
health checks has a section "Health" in their README.

More options can be set with this extended syntax:

~~~
health [ADDRESS] {
    lameduck DURATION
}
~~~

* Where `lameduck` will make the process unhealthy then *wait* for **DURATION** before the process
  shuts down.

If you have multiple Server Blocks and need to export health for each of the plugins, you must run
health endpoints on different ports:

~~~ corefile
com {
    whoami
    health :8080
}

net {
    erratic
    health :8081
}
~~~

Note that if you format this in one server block you will get an error on startup, that the second
server can't setup the health plugin (on the same port).

~~~ txt
com net {
    whoami
    erratic
    health :8080
}
~~~~

## Plugins

Any plugin that implements the Healther interface will be used to report health.

## Metrics

If monitoring is enabled (via the *prometheus* directive) then the following metric is exported:

* `coredns_health_request_duration_seconds{}` - duration to process a /health query. As this should
  be a local operation it should be fast. A (large) increases in this duration indicates the
  CoreDNS process is having trouble keeping up with its query load.

Note that this metric *does not* have a `server` label, because being overloaded is a symptom of
the running process, *not* a specific server.

## Examples

Run another health endpoint on http://localhost:8091.

~~~ corefile
. {
    health localhost:8091
}
~~~

Set a lameduck duration of 1 second:

~~~ corefile
. {
    health localhost:8092 {
        lameduck 1s
    }
}
~~~
A health middleware Start http handler on port 8080 and return OK. Also add some documentation fixes for the prometheus middleware. 2016-04-06 09:21:46 +01:00			`# health`

Manual pages (#1346) * Add manual pages Generate manual pages from the README and extend README with Name and Description sections. The generation requires 'ronn' which may not be available. Just check in all generated manual pages. 2018-01-04 12:53:07 +00:00			`## Name`
docs: less CoreDNS in docs (#1154) Various other changes. 2017-10-20 09:47:43 +01:00
Manual pages (#1346) * Add manual pages Generate manual pages from the README and extend README with Name and Description sections. The generation requires 'ronn' which may not be available. Just check in all generated manual pages. 2018-01-04 12:53:07 +00:00			`health - enables a health check endpoint.`

			`## Description`

plugin/health: doc updates (#1582) Fixes #1564 2018-03-01 18:32:15 -08:00			`By enabling health any plugin that implements`
			`[healt.Healther interface](https://godoc.org/github.com/coredns/coredns/plugin/health#Healther)`
plugin/health: make reload work (#1585) * plugin/health: make reload work Remove the once.Do from the startup, so we can re-bind the HTTP listener. Also clarify the usage of health in multiple server blocks (this is not the best approach - but there isn't a generic solution at this point). Manual tested as we lack testing infra, i.e kill -SIGUSR1 and some CURLing of the health endpoint. * Readme test fix * update * dont need this 2018-03-02 21:40:14 -08:00			`will be queried for it's health. The combined health is exported, by default, on port 8080/health .`
A health middleware Start http handler on port 8080 and return OK. Also add some documentation fixes for the prometheus middleware. 2016-04-06 09:21:46 +01:00
			`## Syntax`

			`~~~`
docs: rewrite using manpage style (#327) This still needs cleanup, but this is a first pass the cleans some cruft and documents our style (in middleware.md) and makes all the docs match that style. 2016-10-10 20:13:22 +01:00			`health [ADDRESS]`
A health middleware Start http handler on port 8080 and return OK. Also add some documentation fixes for the prometheus middleware. 2016-04-06 09:21:46 +01:00			`~~~`

mw/health: poll other middleware (#976) This add the infrastructure to let other middleware report their health status back to the health middleware. A health.Healther interface is introduced and a middleware needs to implement that. A middleware that supports healthchecks is statically configured. Every second each supported middleware is queried and the global health state is updated. Actual tests have been disabled as no other middleware implements this at the moment. 2017-08-27 21:33:38 +01:00			Optionally takes an address; the default is `:8080`. The health path is fixed to `/health`. The
reload: use OnRestart (#1709) * reload: use OnRestart Close the listener on OnRestart for health and metrics so the default setup function can setup the listener when the plugin is "starting up". Lightly test with some SIGUSR1-ing. Also checked the reload plugin with this, seems fine: .com.:1043 .:1043 2018/04/20 15:01:25 [INFO] CoreDNS-1.1.1 2018/04/20 15:01:25 [INFO] linux/amd64, go1.10, CoreDNS-1.1.1 linux/amd64, go1.10, 2018/04/20 15:01:25 [INFO] Running configuration MD5 = aa8b3f03946fb60546ca1f725d482714 2018/04/20 15:02:01 [INFO] Reloading 2018/04/20 15:02:01 [INFO] Running configuration MD5 = b34a96d99e01db4015a892212560155f 2018/04/20 15:02:01 [INFO] Reloading complete ^C2018/04/20 15:02:06 [INFO] SIGINT: Shutting down With this corefile: .com { proxy . 127.0.0.1:53 prometheus :9054 whoami reload } . { proxy . 127.0.0.1:53 prometheus :9054 whoami reload } The prometheus port was 9053, changed that to 54 so reload would pick it up. From a cursory look it seems this also fixes: Fixes #1604 #1618 #1686 #1492 * At least make it test * Use onfinalshutdown * reload: add reload test This test #1604 adn right now fails. * Address review comments * Add bug section explaining things a bit * compile tests * Fix tests * fixes * slightly less crazy * try to make prometheus setup less confusing * Use ephermal port for test * Don't use the listener * These are shared between goroutines, just use the boolean in the main structure. * Fix text in the reload README, * Set addr to TODO once stopping it * Morph fturb's comment into test, to test reload and scrape health and metric endpoint 2018-04-21 17:43:02 +01:00			`health endpoint returns a 200 response code and the word "OK" when this server is healthy. It returns`
			`a 503. health periodically (1s) polls plugins that exports health information. If any of the`
			`plugins signals that it is unhealthy, the server will go unhealthy too. Each plugin that supports`
plugin/health: make reload work (#1585) * plugin/health: make reload work Remove the once.Do from the startup, so we can re-bind the HTTP listener. Also clarify the usage of health in multiple server blocks (this is not the best approach - but there isn't a generic solution at this point). Manual tested as we lack testing infra, i.e kill -SIGUSR1 and some CURLing of the health endpoint. * Readme test fix * update * dont need this 2018-03-02 21:40:14 -08:00			`health checks has a section "Health" in their README.`
A health middleware Start http handler on port 8080 and return OK. Also add some documentation fixes for the prometheus middleware. 2016-04-06 09:21:46 +01:00
plugin/health: add lameduck mode (#1379) * plugin/health: add lameduck mode Add a way to configure lameduck more, i.e. set health to false, stop polling plugins. Then wait for a duration before shutting down. As the health middleware is configured early on in the plugin list, it will hold up all other shutdown, meaning we still answer queries. * Add New * More tests * golint * remove confusing text 2018-01-18 10:40:09 +00:00			`More options can be set with this extended syntax:`

			`~~~`
			`health [ADDRESS] {`
			`lameduck DURATION`
			`}`
			`~~~`

			* Where `lameduck` will make the process unhealthy then wait for DURATION before the process
			`shuts down.`

reload: use OnRestart (#1709) * reload: use OnRestart Close the listener on OnRestart for health and metrics so the default setup function can setup the listener when the plugin is "starting up". Lightly test with some SIGUSR1-ing. Also checked the reload plugin with this, seems fine: .com.:1043 .:1043 2018/04/20 15:01:25 [INFO] CoreDNS-1.1.1 2018/04/20 15:01:25 [INFO] linux/amd64, go1.10, CoreDNS-1.1.1 linux/amd64, go1.10, 2018/04/20 15:01:25 [INFO] Running configuration MD5 = aa8b3f03946fb60546ca1f725d482714 2018/04/20 15:02:01 [INFO] Reloading 2018/04/20 15:02:01 [INFO] Running configuration MD5 = b34a96d99e01db4015a892212560155f 2018/04/20 15:02:01 [INFO] Reloading complete ^C2018/04/20 15:02:06 [INFO] SIGINT: Shutting down With this corefile: .com { proxy . 127.0.0.1:53 prometheus :9054 whoami reload } . { proxy . 127.0.0.1:53 prometheus :9054 whoami reload } The prometheus port was 9053, changed that to 54 so reload would pick it up. From a cursory look it seems this also fixes: Fixes #1604 #1618 #1686 #1492 * At least make it test * Use onfinalshutdown * reload: add reload test This test #1604 adn right now fails. * Address review comments * Add bug section explaining things a bit * compile tests * Fix tests * fixes * slightly less crazy * try to make prometheus setup less confusing * Use ephermal port for test * Don't use the listener * These are shared between goroutines, just use the boolean in the main structure. * Fix text in the reload README, * Set addr to TODO once stopping it * Morph fturb's comment into test, to test reload and scrape health and metric endpoint 2018-04-21 17:43:02 +01:00			`If you have multiple Server Blocks and need to export health for each of the plugins, you must run`
plugin/health: make reload work (#1585) * plugin/health: make reload work Remove the once.Do from the startup, so we can re-bind the HTTP listener. Also clarify the usage of health in multiple server blocks (this is not the best approach - but there isn't a generic solution at this point). Manual tested as we lack testing infra, i.e kill -SIGUSR1 and some CURLing of the health endpoint. * Readme test fix * update * dont need this 2018-03-02 21:40:14 -08:00			`health endpoints on different ports:`

			`~~~ corefile`
			`com {`
			`whoami`
			`health :8080`
			`}`

			`net {`
			`erratic`
			`health :8081`
			`}`
			`~~~`

plugin/health: update README (#1739) * plugin/health: update README Make more clear in the readme that health is limited to 1 server. Fixes #1722 * rephrase and remove ~~~ corefile because it will fail 2018-04-26 08:44:33 +01:00			`Note that if you format this in one server block you will get an error on startup, that the second`
			`server can't setup the health plugin (on the same port).`

			`~~~ txt`
			`com net {`
			`whoami`
			`erratic`
			`health :8080`
			`}`
			`~~~~`

plugin/health: implement dyn health checks (#1214) Implement health.Healther in erratic and kubernetes plugin. The kubernetes' healtcheck is only performed on startup - i.e. turn healthy after the initial loading. Erratic follow the drop count: every query%drop turns the healthcheck unhealthy. Fixes: #985 2017-11-13 09:52:40 +00:00			`## Plugins`

Instead of hardcoding plugin lists in autopath/health, use interfaces. (#1306) Switched health and autopath plugin to allow any plugins to be used instead of a hardcoded list. I did not switch federation over since it wasn't obvious that anything other than kubernetes could be used with it. Fixes #1291 2017-12-12 15:40:30 -05:00			`Any plugin that implements the Healther interface will be used to report health.`
plugin/health: implement dyn health checks (#1214) Implement health.Healther in erratic and kubernetes plugin. The kubernetes' healtcheck is only performed on startup - i.e. turn healthy after the initial loading. Erratic follow the drop count: every query%drop turns the healthcheck unhealthy. Fixes: #985 2017-11-13 09:52:40 +00:00
Overloaded (#1364) * plugin/health: add 'overloaded metrics' Query our on health endpoint and record (and export as a metric) the time it takes. The Get has a 5s timeout, that, when reached, will set the metric duration to 5s. The actually call "I'm I overloaded" is left to an external entity. * README * golint and govet * and the tests 2018-01-10 11:41:22 +00:00			`## Metrics`

			`If monitoring is enabled (via the prometheus directive) then the following metric is exported:`

			* `coredns_health_request_duration_seconds{}` - duration to process a /health query. As this should
			`be a local operation it should be fast. A (large) increases in this duration indicates the`
plugin/health: doc updates (#1582) Fixes #1564 2018-03-01 18:32:15 -08:00			`CoreDNS process is having trouble keeping up with its query load.`
Overloaded (#1364) * plugin/health: add 'overloaded metrics' Query our on health endpoint and record (and export as a metric) the time it takes. The Get has a 5s timeout, that, when reached, will set the metric duration to 5s. The actually call "I'm I overloaded" is left to an external entity. * README * golint and govet * and the tests 2018-01-10 11:41:22 +00:00
plugin/health: clarify server label (#1707) Health overloaded metrics does not carry the server label. Explain why. 2018-04-20 15:03:59 +01:00			Note that this metric does not have a `server` label, because being overloaded is a symptom of
			`the running process, not a specific server.`

A health middleware Start http handler on port 8080 and return OK. Also add some documentation fixes for the prometheus middleware. 2016-04-06 09:21:46 +01:00			`## Examples`
pprof middleware (#138) Add pprof middleware, enabled by pprof directive. 2016-04-28 10:26:58 +01:00
mw/health: poll other middleware (#976) This add the infrastructure to let other middleware report their health status back to the health middleware. A health.Healther interface is introduced and a middleware needs to implement that. A middleware that supports healthchecks is statically configured. Every second each supported middleware is queried and the global health state is updated. Actual tests have been disabled as no other middleware implements this at the moment. 2017-08-27 21:33:38 +01:00			`Run another health endpoint on http://localhost:8091.`

doc update (#1140) * doc update Go through all README and fix mistakes, extend example and let more corefile snippets be test for validity. * Cant use spefic addr in test 2017-10-10 09:39:35 +02:00			`~~~ corefile`
			`. {`
			`health localhost:8091`
			`}`
pprof middleware (#138) Add pprof middleware, enabled by pprof directive. 2016-04-28 10:26:58 +01:00			`~~~`
plugin/health: add lameduck mode (#1379) * plugin/health: add lameduck mode Add a way to configure lameduck more, i.e. set health to false, stop polling plugins. Then wait for a duration before shutting down. As the health middleware is configured early on in the plugin list, it will hold up all other shutdown, meaning we still answer queries. * Add New * More tests * golint * remove confusing text 2018-01-18 10:40:09 +00:00
			`Set a lameduck duration of 1 second:`

			`~~~ corefile`
			`. {`
plugin/health: make reload work (#1585) * plugin/health: make reload work Remove the once.Do from the startup, so we can re-bind the HTTP listener. Also clarify the usage of health in multiple server blocks (this is not the best approach - but there isn't a generic solution at this point). Manual tested as we lack testing infra, i.e kill -SIGUSR1 and some CURLing of the health endpoint. * Readme test fix * update * dont need this 2018-03-02 21:40:14 -08:00			`health localhost:8092 {`
plugin/health: add lameduck mode (#1379) * plugin/health: add lameduck mode Add a way to configure lameduck more, i.e. set health to false, stop polling plugins. Then wait for a duration before shutting down. As the health middleware is configured early on in the plugin list, it will hold up all other shutdown, meaning we still answer queries. * Add New * More tests * golint * remove confusing text 2018-01-18 10:40:09 +00:00			`lameduck 1s`
			`}`
			`}`
			`~~~`