After several experiments at SoundCloud we found that the current
minimum read timeout of 10ms is too low. A single request against a
slow/unavailable authoritative server can cause all TCP connections to
get closed. We record a 50th percentile forward/proxy latency of <5ms,
and a 99th percentile latency of 60ms. Using a minimum timeout of 200ms
seems to be a fair trade-off between avoiding unnecessary high
connection churn and reacting to upstream failures in a timely manner.
This change also renames hcDuration to hcInterval to reflect its usage,
and removes the duplicated timeout constant to make code comprehension
easier.
add a test to see if we copy the rcode correctly. Some minor cleanup in
import ordering and renaming NewUpstream to New as we already are in the
upstream package.
* plugin/forward: erase expired connection by timer
- in previous implementation, the expired connections resided in
cache until new request to the same upstream/protocol came. In
case if the upstream was unhealthy new request may come long time
later or may not come at all. All this time expired connections
held system resources (file descriptors, ephemeral ports). In my
fix the expired connections and related resources are released
by timer
- decreased the complexity of taking connection from cache. The list
of connections is treated as stack (LIFO queue), i.e. the connection
is taken from the end of queue (the most fresh connection) and
returned to the end (as it was implemented before). The remarkable
thing is that all connections in the stack appear to be ordered by
'used' field
- the cleanup() method finds the first good (not expired) connection
in stack with binary search, since all connections are ordered by
'used' field
* fix race conditions
* minor enhancement
* add comments
- connManager() goroutine will stop when Proxy is about to be
garbage collected. This means that no queries are in progress,
and no queries are going to come
* Remove Compress by default
Set Compress = true in Scrub only when the message doesn not fit the
advertized buffer. Doing compression is expensive, so try to avoid it.
Master vs this branch
pkg: github.com/coredns/coredns/plugin/cache
BenchmarkCacheResponse-2 50000 24774 ns/op
pkg: github.com/coredns/coredns/plugin/cache
BenchmarkCacheResponse-2 100000 21960 ns/op
* and make it compile
Rework the TestProxyClose - close the proxy in the *same* goroutine
as where we started it. Close channels as long as we don't get dataraces
(this may need another fix).
Move the Dial goroutine out of the connManager - this simplifies things
*and* makes another goroutine go away and removes the need for connErr
channels - can now just be dns.Conn.
Also:
Revert "plugin/forward: gracefull stop (#1701)"
This reverts commit 135377bf77.
Revert "rework TestProxyClose (#1735)"
This reverts commit 9e8893a0b5.
* plugin/forward: gracefull stop
- stop connection manager only when no queries in progress
* minor improvement
* prevent healthcheck on stopped proxy
* revert closing channels
* use standard context
* global: move to context
Move from golang.org/x/net/context to std lib's context.
Change done with:
for i in $(grep -l '/context' **/*.go); do sed -e 's|golang.org/x/net/context|context|' -i $i; echo $i; done
for i in **/*.go; do goimports -w $i; done
* drop from dns.pb.go as well
With this change the original truncated message returned by requested
server is returned to the client, instead of returning an empty dummy
message with only the truncation bit set.
- each proxy stores average RTT (round trip time) of last rttCount queries.
For now, I assigned the value 4 to rttCount
- the read timeout is calculated as doubled average RTT, but it cannot
exceed default timeout
- initial avg RTT is set to a half of default timeout, so initial timeout
is equal to default timeout
- the RTT for failed read is considered equal to default timeout, so any
failed read will lead to increasing average RTT (up to default timeout)
- dynamic timeouts will let us react faster on lost UDP packets
- in future, we may develop a low-latency forward policy based on
collected RTT values of proxies
* plugin/forward: TCP conns can be closed
Only when we read and get a io.EOF we know the conn is closed (for TCP).
If this is the case Dial (again) and retry. Note that this new
connection can also be closed by the upstream, we may want to add a
DialForceNew or something to get a new TCP connection..
Simular to #1624, *but* this is by (TCP) design. We also don't have to
wait for a timeout which makes it easier to reason about.
* Move to forward.go
* doesnt need changing
* plugin/{cache,forward,proxy}: don't allow responses that are bogus
Responses that are not matching what we've been querying for should be
dropped. They are converted into FormErrs by forward and proxy; as a 2nd
backstop cache will also not cache these.
* plug
* add explicit test
* plugins: Return error for multiple use of some
Return plugin.ErrOnce when a plugin that doesn't support it, is called
mutliple times.
This now adds it for: cache, dnssec, errors, forward, hosts, nsid.
And changes it slightly in kubernetes, pprof, reload, root.
* more tests
* doc: some function/vars/const/package level updates
Various update that stood out while reading godoc.org for CoreDNS.
* Fix some misspellings as well
* plugin/forward: on demand healtchecking
Only start doing health checks when we encouner an error (any error).
This uses the new pluing/pkg/up package to abstract away the actual
checking. This reduces the LOC quite a bit; does need more testing, unit
testing and tcpdumping a bit.
* fix tests
* Fix readme
* Use pkg/up for healthchecks
* remove unused channel
* more cleanups
* update readme
* * Again do go generate and go build; still referencing the wrong forward
repo? Anyway fixed.
* Use pkg/up for doing the healtchecks to cut back on unwanted queries
* Change up.Func to return an error instead of a boolean.
* Drop the string target argument as it doesn't make sense.
* Add healthcheck test on failing to get an upstream answer.
TODO(miek): double check Forward and Lookup and how they interact with
HC, and if we correctly call close() on those
* actual test
* Tests here
* more tests
* try getting rid of host
* Get rid of the host indirection
* Finish removing hosts
* moar testing
* import fmt
* field is not used
* docs
* move some stuff
* bring back health_check
* maxfails=0 test
* git and merging, bah
* review
* plugin/forward: add it
This moves coredns/forward into CoreDNS. Fixes as a few bugs, adds a
policy option and more tests to the plugin.
Update the documentation, test IPv6 address and add persistent tests.
* Always use random policy when spraying
* include scrub fix here as well
* use correct var name
* Code review
* go vet
* Move logging to metrcs
* Small readme updates
* Fix readme