* perf(proxy): use mutex-based connection pool
The proxy package (used for example by the forward plugin) utilized
an actor model where a single connManager goroutine managed
connection pooling via unbuffered channels (dial, yield, ret). This
design serialized all connection acquisition and release operations
through a single goroutine, creating a bottleneck under high
concurrency. This was observable as a performance degradation when
using a single upstream backend compared to multiple backends
(which sharded the bottleneck).
Changes:
- Removed dial, yield, and ret channels from the Transport struct.
- Removed the connManager goroutine's request processing loop.
- Implemented Dial() and Yield() using a sync.Mutex to protect the
connection slice, allowing for fast concurrent access without
context switching.
- Downgraded connManager to a simple background cleanup loop that
only handles connection expiration on a ticker.
- Updated plugin/pkg/proxy/connect.go to use direct method calls
instead of channel sends.
- Updated tests to reflect the removal of internal channels.
Benchmarks show that this change eliminates the single-backend
bottleneck. Now a single upstream backend performs on par with
multiple backends, and overall throughput is improved.
The implementation aligns with standard Go patterns for connection
pooling (e.g., net/http.Transport).
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix: address PR review for persistent.go
- Named mutex field instead of embedding, to not expose
Lock() and Unlock()
- Move stop check outside of lock in Yield()
- Close() without a separate goroutine
- Change stop channel to struct
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix: address code review feedback for conn pool
- Switch from LIFO to FIFO connection selection for source port
diversity, reducing DNS cache poisoning risk (RFC 5452).
- Remove "clear entire cache" optimization as it was LIFO-specific.
FIFO naturally iterates and skips expired connections.
- Remove all goroutines for closing connections; collect connections
while holding lock, close synchronously after releasing lock.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix: remove unused error consts
No longer utilised after refactoring the channel based approach.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* feat(forward): add max_idle_conns option
Add configurable connection pool limit for the forward plugin via
the max_idle_conns Corefile option.
Changes:
- Add SetMaxIdleConns to proxy
- Add maxIdleConns field to Forward struct
- Add max_idle_conns parsing in forward plugin setup
- Apply setting to each proxy during configuration
- Update forward plugin README with new option
By default the value is 0 (unbounded). When set, excess
connections returned to the pool are closed immediately
rather than cached.
Also add a yield related test.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* chore(proxy): simple Dial by closing conns inline
Remove toClose slice collection to reduce complexity. Instead close
expired connections directly while iterating. Reduces complexity with
negligible lock-time impact.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* chore: fewer explicit Unlock calls
Cleaner and less chance of forgetting to unlock on new possible
code paths.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Add configurable resource limits to prevent potential DoS vectors
via connection/stream exhaustion on gRPC, HTTPS, and HTTPS/3 servers.
New configuration plugins:
- grpc_server: configure max_streams, max_connections
- https: configure max_connections
- https3: configure max_streams
Changes:
- Use netutil.LimitListener for connection limiting
- Use gRPC MaxConcurrentStreams and message size limits
- Add QUIC MaxIncomingStreams for HTTPS/3 stream limiting
- Set secure defaults: 256 max streams, 200 max connections
- Setting any limit to 0 means unbounded/fallback to previous impl
Defaults are applied automatically when plugins are omitted from
config.
Includes tests and integration tests.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Replace "reload 2s" with "quic" in quicReloadCorefile to avoid
spawning a background goroutine that reads dnsserver.Port while
TestReadme modifies it. The test TestQUICReloadDoesNotPanic still
verifies the QUIC reload panic fix via explicit inst.Restart() call.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix: prevent QUIC reload panic by lazily initializing the listener
ServePacket on reload receives the reused PacketConn before the new
ServerQUIC has recreated its quic.Listener, so quicListener is nil and
the process panics. Lazily initialise quicListener from the provided
PacketConn when it’s nil and then proceed with ServeQUIC.
fixes: #7679
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
* test: add regression test for QUIC reload panic
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
---------
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Wrap doPrefetch with a fresh metadata context to prevent concurrent
writes to the request-scoped metadata map during background prefetch.
Add a new integration test configuring a plugin chain, triggering
the issue seen here. Hammers concurrent queries while log reads
metadata fields repeatedly.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Update coredns/caddy to a version where Dispenser.NextBlock()
checks Next() and stops at EOF. This ensures forward progress
and prevents an infinite loop when a block is missing a closing '}'
under certain conditions.
Added a regression test.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Fix a panic in presubmit test when import statements are split into
>3 logical blocks (e.g., std, coredns, then third party in multiple
blocks). The computed block index could exceed the fixed array
bounds.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Avoid transient EADDRINUSE and a previous negative WaitGroup panic:
- restart onto a different free port
- no Stop() around Restart()
- channel-coordinated Stop of the new instance
Fixes#7311
Signed-off-by: Syed Azeez <syedazeez337@gmail.com>
Enable protogetter in golangci config and update all protobuf field
access to use getter methods instead of direct field access.
Getter methods provide safer nil pointer handling and return
appropriate default values, following protobuf best practices.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Enable intrange linter to enforce modern Go range syntax over
traditional for loops, by converting:
for i := 0; i < n; i++
to:
for i := range n
Adding type conversions where needed for compatibility
with existing uint64 parameters.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Add conditional test skipping for bind and readme tests that rely on
Linux-specific loopback interface behavior. These tests reference
network configurations that may not exist on for e.g. macOS or other
platforms, causing spurious test failures.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
The rewrite plugin modifies DNS messages, affecting the request
size observed in the coredns_dns_request_size_bytes metric.
This change captures the original request size before any plugins
can modify it. It adds a functional options pattern to Report() to
pass this information while maintaining API compatibility.
Tests have been added to verify the fix prevents rewrite from
affecting the request size metrics.
Docs included.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
www.example.net is now behind akamai with various IP answered and a
chain of CNAME. Let's replace www.example.net by one of the root server
which answer a single IP and hopefully should remain this way.
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* introduce new interface "dnsserver.Viewer", that allows a plugin implementing it to decide if a query should be routed into its server block.
* add new plugin "view", that uses the new interface to enable a user to define expression based conditions that must be met for a query to be routed to its server block.
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* write failures with ResponseReverter instead of letting server write them
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* fix comment
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* Metrics: expand coredns_dns_responses_total with plugin label
This adds (somewhat hacky?) code to add a plugin label to the
coredns_dns_responses_total metric. It's completely obvlious to the
plugin as we just check who called the *recorder.WriteMsg method. We use
runtime.Caller( 1 2 3) to get multiple levels of callers, this should be
deep enough, but it depends on the dns.ResponseWriter wrapping that's
occuring.
README.md of metrics updates and test added in test/metrics_test.go to
check for the label being set.
I went through the plugin to see what metrics could be removed, but
actually didn't find any, the plugin push out metrics that make sense.
Due to the path fiddling to figure out the plugin name I doubt this
works (out-of-the-box) for external plugins, but I haven't tested that.
Signed-off-by: Miek Gieben <miek@miek.nl>
* better comment
Signed-off-by: Miek Gieben <miek@miek.nl>
* Metrics: expand coredns_dns_responses_total with plugin label
This adds (somewhat hacky?) code to add a plugin label to the
coredns_dns_responses_total metric. It's completely obvlious to the
plugin as we just check who called the *recorder.WriteMsg method. We use
runtime.Caller( 1 2 3) to get multiple levels of callers, this should be
deep enough, but it depends on the dns.ResponseWriter wrapping that's
occuring.
README.md of metrics updates and test added in test/metrics_test.go to
check for the label being set.
I went through the plugin to see what metrics could be removed, but
actually didn't find any, the plugin push out metrics that make sense.
Due to the path fiddling to figure out the plugin name I doubt this
works (out-of-the-box) for external plugins, but I haven't tested that.
Signed-off-by: Miek Gieben <miek@miek.nl>
* Update core/dnsserver/server.go
Co-authored-by: dilyevsky <ilyevsky@gmail.com>
* Use [3]string
Signed-off-by: Miek Gieben <miek@miek.nl>
* imports
Signed-off-by: Miek Gieben <miek@miek.nl>
* remove dnstest changes
Signed-off-by: Miek Gieben <miek@miek.nl>
* revert
Signed-off-by: Miek Gieben <miek@miek.nl>
* Add some sleeps to make it less flaky
Signed-off-by: Miek Gieben <miek@miek.nl>
* Revert "Add some sleeps to make it less flaky"
This reverts commit b5c6655196.
* Remove forward when not needed
Signed-off-by: Miek Gieben <miek@miek.nl>
* remove newline
Signed-off-by: Miek Gieben <miek@miek.nl>
Co-authored-by: dilyevsky <ilyevsky@gmail.com>
* write cname answer to client even if target lookup is servfail
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* fix existing unit test expectations
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Deflake by retrying and adding random port numbers. We try 3 times to
get an instance.
Also fix a bug where server.Stop() was called even if the server
creation failed - this was never hit due to t.Fatal() above it, but fix
that nontheless.
Signed-off-by: Miek Gieben <miek@miek.nl>
* share plugins among zones in the same server block
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* update caddy dep
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* simp code
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* copy ListenHosts and Debug from first config
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* copy tls configs from first config
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* add test to validate debug setting is replicated to all configs in block
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* stop server
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
This was found by fuzzing.
We need to make this a fully qualified domain name to catch all errors
in dnsserver/register.go and not later when plugin.Normalize() is called again on these
strings, with the prime difference being that the domain name is fully
qualified. This was found by fuzzing where "ȶ" is deemed OK, but "ȶ." is
not (might be a bug in miekg/dns actually). But here we were checking ȶ,
which is OK, and later we barf in ȶ. leading to "index out of range".
Added a tests and check manually if it would crash with the current code
(yes), and fail with an error in this PR (yes).
Signed-off-by: Miek Gieben <miek@miek.nl>