plugin/forward using pkg/up (#1493)

* plugin/forward: on demand healtchecking Only start doing health checks when we encouner an error (any error). This uses the new pluing/pkg/up package to abstract away the actual checking. This reduces the LOC quite a bit; does need more testing, unit testing and tcpdumping a bit. * fix tests * Fix readme * Use pkg/up for healthchecks * remove unused channel * more cleanups * update readme * * Again do go generate and go build; still referencing the wrong forward repo? Anyway fixed. * Use pkg/up for doing the healtchecks to cut back on unwanted queries * Change up.Func to return an error instead of a boolean. * Drop the string target argument as it doesn't make sense. * Add healthcheck test on failing to get an upstream answer. TODO(miek): double check Forward and Lookup and how they interact with HC, and if we correctly call close() on those * actual test * Tests here * more tests * try getting rid of host * Get rid of the host indirection * Finish removing hosts * moar testing * import fmt * field is not used * docs * move some stuff * bring back health_check * maxfails=0 test * git and merging, bah * review
2025-12-21 17:45:15 -05:00 · 2018-02-15 10:21:57 +01:00
parent 8b035fa938
commit 16504234e5
15 changed files with 306 additions and 221 deletions
--- a/plugin/forward/README.md
+++ b/plugin/forward/README.md
@@ -6,10 +6,17 @@

 ## Description

-The *forward* plugin is generally faster (~30+%) than *proxy* as it re-uses already opened sockets
-to the upstreams. It supports UDP, TCP and DNS-over-TLS and uses inband health checking that is
-enabled by default.
-When *all* upstreams are down it assumes healtchecking as a mechanism has failed and will try to
+The *forward* plugin re-uses already opened sockets to the upstreams. It supports UDP, TCP and
+DNS-over-TLS and uses in band health checking.
+
+When it detects an error a health check is performed. This checks runs in a loop, every *0.5s*, for
+as long as the upstream reports unhealthy. Once healthy we stop health checking (until the next
+error). The health checks use a recursive DNS query (`. IN NS`) to get upstream health. Any response
+that is not a network error (REFUSED, NOTIMPL, SERVFAIL, etc) is taken as a healthy upstream. The
+health check uses the same protocol as specified in **TO**. If `max_fails` is set to 0, no checking
+is performed and upstreams will always be considered healthy.
+
+When *all* upstreams are down it assumes health checking as a mechanism has failed and will try to
 connect to a random upstream (which may or may not work).

 ## Syntax
@@ -22,16 +29,11 @@ forward FROM TO...

 * **FROM** is the base domain to match for the request to be forwarded.
 * **TO...** are the destination endpoints to forward to. The **TO** syntax allows you to specify
-  a protocol, `tls://9.9.9.9` or `dns://` for plain DNS. The number of upstreams is limited to 15.
+  a protocol, `tls://9.9.9.9` or `dns://` (or no protocol) for plain DNS. The number of upstreams is
+  limited to 15.

-The health checks are done every *0.5s*. After *two* failed checks the upstream is considered
-unhealthy. The health checks use a recursive DNS query (`. IN NS`) to get upstream health. Any
-response that is not an error (REFUSED, NOTIMPL, SERVFAIL, etc) is taken as a healthy upstream. The
-health check uses the same protocol as specific in the **TO**. On startup each upstream is marked
-unhealthy until it passes a health check. A 0 duration will disable any health checks.
-
-Multiple upstreams are randomized (default policy) on first use. When a healthy proxy returns an
-error during the exchange the next upstream in the list is tried.
+Multiple upstreams are randomized (see `policy`) on first use. When a healthy proxy returns an error
+during the exchange the next upstream in the list is tried.

 Extra knobs are available with an expanded syntax:

@@ -39,12 +41,12 @@ Extra knobs are available with an expanded syntax:
 forward FROM TO... {
    except IGNORED_NAMES...
    force_tcp
-    health_check DURATION
    expire DURATION
    max_fails INTEGER
    tls CERT KEY CA
    tls_servername NAME
    policy random|round_robin
+    health_checks DURATION
 }
 ~~~

@@ -52,21 +54,16 @@ forward FROM TO... {
 * **IGNORED_NAMES** in `except` is a space-separated list of domains to exclude from forwarding.
  Requests that match none of these names will be passed through.
 * `force_tcp`, use TCP even when the request comes in over UDP.
-* `health_checks`, use a different **DURATION** for health checking, the default duration is 0.5s.
-  A value of 0 disables the health checks completely.
 * `max_fails` is the number of subsequent failed health checks that are needed before considering
-  a backend to be down. If 0, the backend will never be marked as down. Default is 2.
+  an upstream to be down. If 0, the upstream will never be marked as down (nor health checked).
+  Default is 2.
 * `expire` **DURATION**, expire (cached) connections after this time, the default is 10s.
 * `tls` **CERT** **KEY** **CA** define the TLS properties for TLS; if you leave this out the
  system's configuration will be used.
 * `tls_servername` **NAME** allows you to set a server name in the TLS configuration; for instance 9.9.9.9
  needs this to be set to `dns.quad9.net`.
 * `policy` specifies the policy to use for selecting upstream servers. The default is `random`.
-
-The upstream selection is done via random (default policy) selection. If the socket for this client
-isn't known *forward* will randomly choose one. If this turns out to be unhealthy, the next one is
-tried. If *all* hosts are down, we assume health checking is broken and select a *random* upstream to
-try.
+* `health_checks`, use a different **DURATION** for health checking, the default duration is 0.5s.

 Also note the TLS config is "global" for the whole forwarding proxy if you need a different
 `tls-name` for different upstreams you're out of luck.
@@ -80,7 +77,7 @@ If monitoring is enabled (via the *prometheus* directive) then the following met
 * `coredns_forward_response_rcode_total{to, rcode}` - count of RCODEs per upstream.
 * `coredns_forward_healthcheck_failure_count_total{to}` - number of failed health checks per upstream.
 * `coredns_forward_healthcheck_broken_count_total{}` - counter of when all upstreams are unhealthy,
-  and we are randomly spraying to a target.
+  and we are randomly (this always uses the `random` policy) spraying to an upstream.
 * `coredns_forward_socket_count_total{to}` - number of cached sockets per upstream.

 Where `to` is one of the upstream servers (**TO** from the config), `proto` is the protocol used by
@@ -125,16 +122,10 @@ Proxy everything except `example.org` using the host's `resolv.conf`'s nameserve
 }
 ~~~

-Forward to a IPv6 host:
-
-~~~ corefile
-. {
-    forward . [::1]:1053
-}
-~~~
-
 Proxy all requests to 9.9.9.9 using the DNS-over-TLS protocol, and cache every answer for up to 30
-seconds.
+seconds. Note the `tls_servername` is mandatory if you want a working setup, as 9.9.9.9 can't be
+used in the TLS negotiation. Also set the health check duration to 5s to not completely swamp the
+service with health checks.

 ~~~ corefile
 . {
@@ -148,7 +139,7 @@ seconds.

 ## Bugs

-The TLS config is global for the whole forwarding proxy if you need a different `tls-name` for
+The TLS config is global for the whole forwarding proxy if you need a different `tls_serveraame` for
 different upstreams you're out of luck.

 ## Also See