Troubleshooting — Pragmatic.ControlPlane
FAQ, connection issues, stale hosts, health reporting, and recovery procedures.
Frequently Asked Questions
Section titled “Frequently Asked Questions”Q: Do I need the control plane for a single-instance deployment?
Section titled “Q: Do I need the control plane for a single-instance deployment?”No. All interfaces (IHostIdentity, IHostStatus, IControlPlane) have NoOp defaults registered by Pragmatic.Composition.Host. A monolith works without any UseControlPlane() call.
Q: Can the hub run in the same process as the application?
Section titled “Q: Can the hub run in the same process as the application?”Yes. This is the “embedded hub” mode. Add both Pragmatic.ControlPlane.SignalR and Pragmatic.ControlPlane.Client to the same host. The ControlPlaneConnectionService defers connection until after Kestrel starts listening, avoiding self-connection deadlocks.
Q: What happens if the hub goes down?
Section titled “Q: What happens if the hub goes down?”Clients degrade to NoOp behavior. Each host continues serving requests independently. When the hub comes back, clients auto-reconnect (exponential backoff) and re-register. No data is lost — the hub is stateless (in-memory registry rebuilt from re-registrations).
Q: Can I query connected hosts without SignalR?
Section titled “Q: Can I query connected hosts without SignalR?”Yes. The hub exposes REST fallback endpoints:
# All hostscurl https://control-plane:5100/_pragmatic/control-plane/hosts
# Specific hostcurl https://control-plane:5100/_pragmatic/control-plane/hosts/{hostId}
# Audit trailcurl https://control-plane:5100/_pragmatic/control-plane/auditQ: How do I push maintenance mode to all hosts?
Section titled “Q: How do I push maintenance mode to all hosts?”Send a MaintenanceCommand via the hub:
var error = await controlPlane.SendCommandAsync( targetHostId: "*", // broadcast new MaintenanceCommand(Enable: true), ct);Or use the hub method directly (server-side):
await hub.Clients.All.OnCommand("admin", "MaintenanceCommand", JsonSerializer.Serialize(new MaintenanceCommand(Enable: true)));Q: How do I scale the hub for high availability?
Section titled “Q: How do I scale the hub for high availability?”Add a SignalR backplane:
builder.Services.AddSignalR() .AddStackExchangeRedis("redis:6379");This syncs the registry across hub instances. Without a backplane, each hub instance has its own isolated registry.
Connection Issues
Section titled “Connection Issues”Client cannot connect to hub
Section titled “Client cannot connect to hub”Symptom: Failed to connect to control plane at {HubUrl}, operating in degraded mode
Checklist:
| # | Check | How |
|---|---|---|
| 1 | Hub is running | curl https://hub-url/_pragmatic/control-plane/hosts |
| 2 | URL is correct | Verify HubUrl in appsettings.json — must include full path /_pragmatic/control-plane |
| 3 | API key matches | Compare hub and client ApiKey values |
| 4 | Network reachable | ping hub-host or telnet hub-host 5100 |
| 5 | TLS certificate valid | Check for certificate errors in logs |
| 6 | Firewall allows port | Verify port 5100 (or configured port) is open |
Frequent reconnections
Section titled “Frequent reconnections”Symptom: Logs show repeated Control plane connection lost, reconnecting... followed by Reconnected to control plane, re-registering...
Causes:
- Network instability — Intermittent connectivity between client and hub
- Hub under load — Too many connected hosts overwhelming the SignalR hub
- Keep-alive timeout — SignalR keep-alive not matching infrastructure timeouts (load balancers, proxies)
Fix: Increase the heartbeat interval to reduce hub load:
cp.WithHeartbeatInterval(TimeSpan.FromSeconds(60));For load balancers, ensure WebSocket connections are not terminated prematurely. Configure idle timeout to at least 2x the heartbeat interval.
Connection closed permanently
Section titled “Connection closed permanently”Symptom: Control plane connection closed permanently, operating in degraded mode
Cause: The client exhausted all reconnection attempts (MaxReconnectAttempts).
Fix: Increase the max attempts or set to a higher value:
cp.WithMaxReconnectAttempts(20); // Default is 10The exponential backoff caps at 30 seconds per attempt:
Attempt 1: 1s, Attempt 2: 2s, Attempt 3: 4s, ... Attempt 5+: 30sStale Hosts
Section titled “Stale Hosts”Hosts showing as connected but actually down
Section titled “Hosts showing as connected but actually down”Symptom: GET /_pragmatic/control-plane/hosts shows a host with an old LastHeartbeat timestamp, but the host process is dead.
Cause: The stale eviction interval has not elapsed yet.
Fix: Reduce the stale eviction interval on the hub:
hub.WithStaleEvictionInterval(TimeSpan.FromMinutes(1));After the interval elapses, StaleHostEvictionService removes the host and broadcasts OnHostDisconnected.
Hosts repeatedly appearing and disappearing
Section titled “Hosts repeatedly appearing and disappearing”Symptom: A host shows up, then is evicted, then re-registers, in a loop.
Causes:
- Heartbeat not reaching hub — Network issues between specific client and hub
- Eviction interval too short — See Common Mistake #4
- Client GC pauses — Long GC pauses cause heartbeat to be delayed past the eviction threshold
Diagnosis: Compare HeartbeatInterval (client) with StaleEvictionInterval (hub). The eviction interval should be at least 4x the heartbeat interval.
Health Reporting Issues
Section titled “Health Reporting Issues”Health always shows Healthy
Section titled “Health always shows Healthy”Cause: No IHostHealthContributor implementations are registered. Without contributors, the aggregator defaults to Healthy.
Fix: Register health contributors for critical dependencies:
services.AddSingleton<IHostHealthContributor, DatabaseHealthContributor>();Health contributor throws exceptions
Section titled “Health contributor throws exceptions”Symptom: Health check fails with an unhandled exception from a contributor.
Cause: The contributor’s CheckAsync method throws instead of returning Unhealthy.
Fix: Always catch exceptions in health contributors:
public async Task<HealthContribution> CheckAsync(CancellationToken ct){ try { // Check dependency... return new HealthContribution(Name, ContributorHealthStatus.Healthy); } catch (Exception ex) { return new HealthContribution(Name, ContributorHealthStatus.Unhealthy, ex.Message); }}The HostHealthAggregator does catch contributor exceptions, but a clean return is preferred.
Command Dispatch Issues
Section titled “Command Dispatch Issues”Command not received by target host
Section titled “Command not received by target host”Symptom: SendCommandAsync returns null (success) but the target host does not execute the command.
Checklist:
- Target host connected? — Check
GetAllHostsAsync()for the targetHostId - Handler registered? — Verify
IHostCommandHandler<T>is in DI on the target host - Command type matches? — The type name must match between sender and receiver
- Dispatcher present? —
IHostCommandDispatchermust be registered (automatic withComposition.Host)
“No IHostCommandDispatcher registered” in logs
Section titled ““No IHostCommandDispatcher registered” in logs”Cause: The HostCommandDispatcher is registered by Pragmatic.Composition.Host. If the host does not reference Composition.Host, commands are silently ignored.
Fix: Ensure the host project references Pragmatic.Composition.Host (it should, for PragmaticApp.RunAsync).
Command audit shows “Host not connected”
Section titled “Command audit shows “Host not connected””Cause: The target host ID does not match any registered host.
Fix: Use the correct HostId. Query GET /_pragmatic/control-plane/hosts to find the current host IDs. Remember: HostId is regenerated on every restart (Guid7).
Migration Coordination Issues
Section titled “Migration Coordination Issues”Two hosts both claiming migration leadership
Section titled “Two hosts both claiming migration leadership”Symptom: Both hosts log “Migration leadership claimed for {Database} by {HostId}”.
Cause: If the hub is not used for migration coordination, DatabaseLeaderElection operates independently. Two hosts connecting to the same database should be coordinated by the __PragmaticLock table.
Diagnosis: Check if both hosts are using the same connection string. Different connection strings (even to the same database) may not share the lock table.
Migration leadership not released after completion
Section titled “Migration leadership not released after completion”Symptom: New hosts cannot claim migration leadership. The hub shows a leader but that host is no longer migrating.
Cause: The host completed migrations but did not call ReleaseMigrationLeadership. This can happen if the host crashes between migration completion and release.
Fix for hub-based leadership: The ControlPlaneHub.OnDisconnectedAsync handler calls CleanupMigrationLeadership automatically when a host disconnects. If the host is still connected but stuck, restart it.
Fix for DatabaseLeaderElection: The lock expires after the configured timeout (default: 5 minutes). Wait for expiry, or manually release:
UPDATE "__PragmaticLock"SET "HolderId" = NULL, "AcquiredAt" = NULL, "ExpiresAt" = NULLWHERE "LockName" = 'migration';Recovery Procedures
Section titled “Recovery Procedures”Hub Restart Recovery
Section titled “Hub Restart Recovery”When the hub restarts:
- All clients detect disconnection and start reconnecting
- On reconnect, each client re-registers its
HostInfo - The registry is rebuilt from scratch (stateless hub)
- Migration leadership state is lost — but
DatabaseLeaderElectionprovides the safety net - Command audit log is lost (in-memory)
For persistent audit, consider forwarding audit entries to an external store.
Client Restart Recovery
Section titled “Client Restart Recovery”When a client host restarts:
- The hub detects disconnection via
OnDisconnectedAsync - Migration leadership held by that host is released
- Other hosts are notified via
OnHostDisconnected - When the host starts again, it gets a new
HostIdand re-registers
Full Cluster Restart
Section titled “Full Cluster Restart”When all hosts restart simultaneously:
- The hub starts first (if dedicated) or with the primary host (if embedded)
- Clients connect after
ApplicationStarted - Migration leadership is contested —
DatabaseLeaderElectionensures only one wins - After migrations complete, all hosts transition to
Serving
Diagnostic Checklist
Section titled “Diagnostic Checklist”| # | Check | How |
|---|---|---|
| 1 | Hub is running | curl https://hub/_pragmatic/control-plane/hosts |
| 2 | Client is connected | Check IControlPlane.IsConnected or logs |
| 3 | API keys match | Compare hub and client configuration |
| 4 | Heartbeats flowing | Check LastHeartbeat in host list |
| 5 | Stale eviction configured | Hub StaleEvictionInterval >= 4x client HeartbeatInterval |
| 6 | Health contributors registered | Check DI for IHostHealthContributor |
| 7 | Commands dispatched | Check /_pragmatic/control-plane/audit |
| 8 | Migration leadership clean | Check hub registry or __PragmaticLock table |
Log Level Configuration
Section titled “Log Level Configuration”For detailed control plane diagnostics:
{ "Logging": { "LogLevel": { "Pragmatic.ControlPlane": "Debug", "Pragmatic.ControlPlane.Client": "Debug", "Pragmatic.ControlPlane.SignalR": "Debug", "Microsoft.AspNetCore.SignalR": "Debug" } }}At Debug level, the client logs every heartbeat attempt, reconnection, and command dispatch. The hub logs every registration, state change, and stale eviction check.