plugin/forward using pkg/up (#1493)

* plugin/forward: on demand healtchecking

Only start doing health checks when we encouner an error (any error).
This uses the new pluing/pkg/up package to abstract away the actual
checking. This reduces the LOC quite a bit; does need more testing, unit
testing and tcpdumping a bit.

* fix tests

* Fix readme

* Use pkg/up for healthchecks

* remove unused channel

* more cleanups

* update readme

* * Again do go generate and go build; still referencing the wrong forward
  repo? Anyway fixed.
* Use pkg/up for doing the healtchecks to cut back on unwanted queries
  * Change up.Func to return an error instead of a boolean.
  * Drop the string target argument as it doesn't make sense.
* Add healthcheck test on failing to get an upstream answer.

TODO(miek): double check Forward and Lookup and how they interact with
HC, and if we correctly call close() on those

* actual test

* Tests here

* more tests

* try getting rid of host

* Get rid of the host indirection

* Finish removing hosts

* moar testing

* import fmt

* field is not used

* docs

* move some stuff

* bring back health_check

* maxfails=0 test

* git and merging, bah

* review
This commit is contained in:
Miek Gieben
2018-02-15 10:21:57 +01:00
committed by GitHub
parent 8b035fa938
commit 16504234e5
15 changed files with 306 additions and 221 deletions

View File

@@ -6,10 +6,17 @@
## Description
The *forward* plugin is generally faster (~30+%) than *proxy* as it re-uses already opened sockets
to the upstreams. It supports UDP, TCP and DNS-over-TLS and uses inband health checking that is
enabled by default.
When *all* upstreams are down it assumes healtchecking as a mechanism has failed and will try to
The *forward* plugin re-uses already opened sockets to the upstreams. It supports UDP, TCP and
DNS-over-TLS and uses in band health checking.
When it detects an error a health check is performed. This checks runs in a loop, every *0.5s*, for
as long as the upstream reports unhealthy. Once healthy we stop health checking (until the next
error). The health checks use a recursive DNS query (`. IN NS`) to get upstream health. Any response
that is not a network error (REFUSED, NOTIMPL, SERVFAIL, etc) is taken as a healthy upstream. The
health check uses the same protocol as specified in **TO**. If `max_fails` is set to 0, no checking
is performed and upstreams will always be considered healthy.
When *all* upstreams are down it assumes health checking as a mechanism has failed and will try to
connect to a random upstream (which may or may not work).
## Syntax
@@ -22,16 +29,11 @@ forward FROM TO...
* **FROM** is the base domain to match for the request to be forwarded.
* **TO...** are the destination endpoints to forward to. The **TO** syntax allows you to specify
a protocol, `tls://9.9.9.9` or `dns://` for plain DNS. The number of upstreams is limited to 15.
a protocol, `tls://9.9.9.9` or `dns://` (or no protocol) for plain DNS. The number of upstreams is
limited to 15.
The health checks are done every *0.5s*. After *two* failed checks the upstream is considered
unhealthy. The health checks use a recursive DNS query (`. IN NS`) to get upstream health. Any
response that is not an error (REFUSED, NOTIMPL, SERVFAIL, etc) is taken as a healthy upstream. The
health check uses the same protocol as specific in the **TO**. On startup each upstream is marked
unhealthy until it passes a health check. A 0 duration will disable any health checks.
Multiple upstreams are randomized (default policy) on first use. When a healthy proxy returns an
error during the exchange the next upstream in the list is tried.
Multiple upstreams are randomized (see `policy`) on first use. When a healthy proxy returns an error
during the exchange the next upstream in the list is tried.
Extra knobs are available with an expanded syntax:
@@ -39,12 +41,12 @@ Extra knobs are available with an expanded syntax:
forward FROM TO... {
except IGNORED_NAMES...
force_tcp
health_check DURATION
expire DURATION
max_fails INTEGER
tls CERT KEY CA
tls_servername NAME
policy random|round_robin
health_checks DURATION
}
~~~
@@ -52,21 +54,16 @@ forward FROM TO... {
* **IGNORED_NAMES** in `except` is a space-separated list of domains to exclude from forwarding.
Requests that match none of these names will be passed through.
* `force_tcp`, use TCP even when the request comes in over UDP.
* `health_checks`, use a different **DURATION** for health checking, the default duration is 0.5s.
A value of 0 disables the health checks completely.
* `max_fails` is the number of subsequent failed health checks that are needed before considering
a backend to be down. If 0, the backend will never be marked as down. Default is 2.
an upstream to be down. If 0, the upstream will never be marked as down (nor health checked).
Default is 2.
* `expire` **DURATION**, expire (cached) connections after this time, the default is 10s.
* `tls` **CERT** **KEY** **CA** define the TLS properties for TLS; if you leave this out the
system's configuration will be used.
* `tls_servername` **NAME** allows you to set a server name in the TLS configuration; for instance 9.9.9.9
needs this to be set to `dns.quad9.net`.
* `policy` specifies the policy to use for selecting upstream servers. The default is `random`.
The upstream selection is done via random (default policy) selection. If the socket for this client
isn't known *forward* will randomly choose one. If this turns out to be unhealthy, the next one is
tried. If *all* hosts are down, we assume health checking is broken and select a *random* upstream to
try.
* `health_checks`, use a different **DURATION** for health checking, the default duration is 0.5s.
Also note the TLS config is "global" for the whole forwarding proxy if you need a different
`tls-name` for different upstreams you're out of luck.
@@ -80,7 +77,7 @@ If monitoring is enabled (via the *prometheus* directive) then the following met
* `coredns_forward_response_rcode_total{to, rcode}` - count of RCODEs per upstream.
* `coredns_forward_healthcheck_failure_count_total{to}` - number of failed health checks per upstream.
* `coredns_forward_healthcheck_broken_count_total{}` - counter of when all upstreams are unhealthy,
and we are randomly spraying to a target.
and we are randomly (this always uses the `random` policy) spraying to an upstream.
* `coredns_forward_socket_count_total{to}` - number of cached sockets per upstream.
Where `to` is one of the upstream servers (**TO** from the config), `proto` is the protocol used by
@@ -125,16 +122,10 @@ Proxy everything except `example.org` using the host's `resolv.conf`'s nameserve
}
~~~
Forward to a IPv6 host:
~~~ corefile
. {
forward . [::1]:1053
}
~~~
Proxy all requests to 9.9.9.9 using the DNS-over-TLS protocol, and cache every answer for up to 30
seconds.
seconds. Note the `tls_servername` is mandatory if you want a working setup, as 9.9.9.9 can't be
used in the TLS negotiation. Also set the health check duration to 5s to not completely swamp the
service with health checks.
~~~ corefile
. {
@@ -148,7 +139,7 @@ seconds.
## Bugs
The TLS config is global for the whole forwarding proxy if you need a different `tls-name` for
The TLS config is global for the whole forwarding proxy if you need a different `tls_serveraame` for
different upstreams you're out of luck.
## Also See