Skip to content

Architecture and Core Concepts

This guide explains why Pragmatic.Resilience exists, how its pieces fit together, and how to choose the right strategy for each situation. Read this before diving into the individual policy guides.


Distributed systems fail. HTTP calls time out, databases go down, third-party APIs return 500s, and downstream services get overwhelmed. Without resilience patterns, a single failing dependency can cascade across your entire application.

public class PaymentGatewayClient(HttpClient http)
{
public async Task<PaymentResult> ChargeAsync(ChargeRequest request, CancellationToken ct)
{
try
{
var response = await http.PostAsJsonAsync("/charges", request, ct);
response.EnsureSuccessStatusCode();
return await response.Content.ReadFromJsonAsync<PaymentResult>(ct);
}
catch (HttpRequestException)
{
// Retry? How many times? With what delay?
// What if the service is down for minutes?
throw;
}
catch (TaskCanceledException)
{
// Timeout? Or did the caller cancel?
throw;
}
}
}

One external call, and the problems are already stacking up:

  • No retry logic. A single transient network blip causes a hard failure.
  • No timeout control. The default HttpClient timeout is 100 seconds. A hung downstream service keeps the request thread blocked for nearly two minutes.
  • No circuit breaking. If the payment gateway is down, every request still attempts the call, wasting resources and increasing latency for all users.
  • No concurrency control. Under load, hundreds of threads pile into the failing service simultaneously.

The Polly/manual approach: scattered, coupled, allocation-heavy

Section titled “The Polly/manual approach: scattered, coupled, allocation-heavy”

Developers often turn to Polly or hand-roll resilience logic:

// Polly approach: works, but has tradeoffs
var retryPolicy = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(3, attempt => TimeSpan.FromMilliseconds(200 * Math.Pow(2, attempt)));
var circuitBreaker = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));
var timeout = Policy.TimeoutAsync(10);
var combined = Policy.WrapAsync(timeout, circuitBreaker, retryPolicy);
await combined.ExecuteAsync(ct => http.PostAsJsonAsync("/charges", request, ct), ct);

This works, but:

  • External dependency. Polly pulls in transitive NuGet packages and is not AOT-safe without careful configuration.
  • Scattered configuration. Policies are defined in code, in DI registration, and sometimes in configuration files. There is no single source of truth.
  • No SG integration. Every DomainAction that calls an external service must manually wire its pipeline. Forget once, and that action runs without protection.
  • Result-unaware. Polly retries on exceptions but does not understand Result<T, E>. A validation error (Result.Failure) would be retried as if it were a transient failure.

The fundamental issue: resilience is a cross-cutting concern that should be declared, not implemented. Every external call needs the same patterns, yet every developer implements them differently.


Pragmatic.Resilience inverts the model. You declare which resilience policy applies to an operation, and the framework composes the pipeline at runtime (or the source generator wires it at compile time for DomainActions).

The same payment gateway call with Pragmatic.Resilience:

[DomainAction]
[ResiliencePolicy("payment-gateway")]
public partial class ChargeCustomerAction : DomainAction<PaymentResult>
{
public override async Task<Result<PaymentResult, IError>> Execute(CancellationToken ct)
{
// This execution is wrapped by the "payment-gateway" resilience pipeline.
// Exceptions trigger retry + circuit breaker.
// Result failures (validation, not found) pass through unchanged.
var response = await http.PostAsJsonAsync("/charges", request, ct);
response.EnsureSuccessStatusCode();
return await response.Content.ReadFromJsonAsync<PaymentResult>(ct);
}
}

The policy is defined once in configuration:

{
"Resilience": {
"Policies": {
"payment-gateway": {
"Timeout": { "Timeout": "00:00:10" },
"Retry": { "MaxRetries": 3, "BackoffType": "Exponential", "BaseDelay": "00:00:00.200" },
"CircuitBreaker": { "FailureThreshold": 5, "BreakDuration": "00:00:30" }
}
}
}
}

Five strategies, one attribute, zero manual wiring. The source generator injects IResiliencePipelineProvider and wraps the action’s execution. The configuration can change without recompiling.

Key design properties:

  • Native, AOT-safe. No Polly dependency. No external NuGet packages. Full AOT compatibility.
  • Result-aware. Only exceptions trigger resilience strategies. Result<T, E> failures are business errors — retrying a “not found” or “validation failed” would be wrong.
  • Zero overhead when unused. Unknown policy names resolve to PassthroughPipeline.Instance, a no-op singleton. No runtime errors, no wrapping, no allocation.
  • Declarative configuration. Policies live in appsettings.json or fluent code. The same policy name is reused across multiple operations.
  • Observable by default. Every strategy emits structured logs ([LoggerMessage]), OpenTelemetry traces (ActivitySource), and metrics (Meter).

Every resilience-protected operation flows through a deterministic pipeline. Strategies are composable and ordered by their Order value (ascending). Lower order means the strategy wraps more of the pipeline — it is more “external.”

Request
|
v
Timeout (Order 100) -------> Cancels if total time exceeded
|
v
Bulkhead (Order 200) ------> Rejects if max concurrency reached
|
v
Circuit Breaker (Order 300) > Rejects if circuit is open
|
v
Retry (Order 400) ----------> Retries on transient exception
|
v
Fallback (Order 500) -------> Catches exception, returns alternative
|
v
Your Operation

The order is fixed by StrategyOrder constants and enforced by the pipeline builder. You do not choose the order — the framework composes them correctly regardless of the order you add them in the builder.

Timeout wraps everything (Order 100). If the total time — including all retries and circuit breaker probes — exceeds the timeout, the entire pipeline is cancelled. This prevents a retry storm from running indefinitely.

Bulkhead runs before circuit breaker (Order 200). Even if the circuit is closed, you want to limit how many concurrent requests hit the downstream service. This prevents resource exhaustion during healthy operation, not just during failures.

Circuit breaker runs before retry (Order 300). If the circuit is open, there is no point retrying. The circuit breaker rejects the request immediately with CircuitBrokenException, saving the retry attempts for when the service recovers.

Retry is close to the operation (Order 400). Each retry attempt goes back through the circuit breaker (which records the failure) but not through the timeout or bulkhead (which already allocated their slot). This means retry failures contribute to the circuit breaker’s failure count.

Fallback is innermost (Order 500). If everything else fails — retries exhausted, circuit open, timeout exceeded — the fallback provides a degraded response instead of an exception.


Retries the operation on transient exceptions with configurable backoff and jitter.

When to use: Transient failures — network blips, HTTP 503 from a temporarily overloaded service, database connection timeouts.

When NOT to use: Permanent failures (authentication errors, 404s, validation failures). Retrying these wastes resources and delays the error response.

new RetryOptions
{
MaxRetries = 3, // 0 = no retries, max 100
BaseDelay = TimeSpan.FromMilliseconds(200),
BackoffType = BackoffType.Exponential, // Constant, Linear, or Exponential
MaxDelay = TimeSpan.FromSeconds(30), // Cap to prevent unbounded growth
UseJitter = true, // Decorrelated jitter (default: true)
ShouldRetry = ex => ex is HttpRequestException or TimeoutException
}

Backoff formulas:

  • Constant: baseDelay (same delay every time)
  • Linear: baseDelay * (attempt + 1) (200ms, 400ms, 600ms, …)
  • Exponential: baseDelay * 2^attempt (200ms, 400ms, 800ms, …)

Jitter: When UseJitter is enabled, the computed delay is multiplied by a random factor in [0.5, 1.5). This is the decorrelated jitter algorithm recommended by AWS to prevent thundering herd — where many clients retry at the same instant after a shared failure. Thread-local Random avoids lock contention in high-throughput scenarios.

Exception on exhaustion: Throws RetryExhaustedException when all attempts fail. The inner exception contains the last failure encountered.

Cancels the operation if it exceeds the configured duration.

When to use: Any external call where you need a hard upper bound on latency. Network calls, database queries, third-party API calls.

When NOT to use: CPU-bound computation (use Task.Run with cancellation instead). Long-running background jobs that are expected to run for minutes.

new TimeoutOptions
{
Timeout = TimeSpan.FromSeconds(10),
TimeoutType = TimeoutType.Optimistic // or Pessimistic
}

Timeout types:

  • Optimistic (default): Creates a linked CancellationToken and cancels it after the timeout. Preferred for all operations that honor cancellation tokens (most async .NET APIs, HttpClient, EF Core, etc.).
  • Pessimistic: Races Task.Delay against the operation via Task.WhenAny. For operations that do not honor cancellation (legacy sync code wrapped in a task, third-party libraries that ignore tokens). The operation continues running in the background after timeout — use with caution.

Exception on timeout: Throws TimeoutRejectedException with the operation name and timeout duration.

Opens after consecutive failures, rejects requests while open, allows a single probe after the break duration elapses.

When to use: External dependencies that can be fully down for extended periods. Payment gateways, email services, third-party APIs.

When NOT to use: Operations where every call is independent and has no relationship to previous failures (e.g., reading different files from disk).

new CircuitBreakerOptions
{
FailureThreshold = 5, // Consecutive failures before opening
BreakDuration = TimeSpan.FromSeconds(30),
ShouldHandle = ex => ex is not ArgumentException // Don't count argument errors
}

State machine:

Closed ---[threshold failures]--> Open ---[break elapsed]--> HalfOpen
^ |
| |
+----[probe succeeds]----<------<------<------<------<-------+
|
Open <----[probe fails]----<------<------<------<------<-----+
  • Closed: Normal operation. Failures are counted. Success resets the failure count.
  • Open: All requests rejected immediately with CircuitBrokenException. No attempt to call the downstream service.
  • HalfOpen: One probe request is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens for another break duration.

State store: Circuit state is managed by ICircuitBreakerStateStore. The default InMemoryCircuitBreakerStateStore is thread-safe and per-process. For distributed scenarios (multiple app instances sharing circuit state), implement the interface with Redis or a database backend.

Circuit key: The circuit key defaults to ResilienceContext.OperationKey if set, otherwise OperationName. This means you can share one circuit across multiple operations that call the same downstream service, or isolate them with different keys.

Limits concurrent executions using SemaphoreSlim to prevent one operation from consuming all available threads or connections.

When to use: Operations that access a shared resource with limited capacity (database connection pool, external API with rate limits, shared file system).

When NOT to use: CPU-bound operations (use Task.Run with a bounded TaskScheduler instead). Operations that are already bounded by other mechanisms (e.g., an HTTP client with MaxConnectionsPerServer).

new BulkheadOptions
{
MaxConcurrency = 10, // Maximum concurrent executions
MaxQueuedActions = 5, // Overflow queue (0 = no queue)
QueueTimeout = TimeSpan.FromSeconds(2) // Max wait in queue
}

When all slots are taken and the queue is full (or disabled with MaxQueuedActions = 0), the request is rejected immediately with BulkheadRejectedException.

Catches exceptions and provides an alternative result. This is the “last resort” strategy for graceful degradation.

When to use: Operations where a degraded response is better than an error. Returning cached data when the source is unavailable, returning a default configuration when the config service is down.

When NOT to use: Critical operations where partial data could cause data corruption or inconsistency (financial transactions, inventory updates).

new FallbackOptions<UserDto>
{
FallbackAction = (ex, ctx, ct) => Task.FromResult(UserDto.Default),
ShouldHandle = ex => ex is HttpRequestException,
OnFallback = (ex, ctx) => logger.LogWarning("Using fallback for {Op}", ctx.OperationName)
}

Fallback is a generic strategy (FallbackStrategy<TResult>) that only activates when the result type matches. It is not configurable via appsettings.json because it requires a typed factory delegate — use the fluent builder.


Strategies compose into a single pipeline. In configuration, set any strategy to null (or omit it) to exclude it from the pipeline.

new ResiliencePolicyOptions
{
Timeout = new() { Timeout = TimeSpan.FromSeconds(10) }, // Included
Retry = new() { MaxRetries = 3 }, // Included
CircuitBreaker = new() { FailureThreshold = 5 }, // Included
Bulkhead = null // Excluded
}

Only non-null strategies are composed. A policy with only Timeout set produces a pipeline with exactly one strategy. A policy with nothing set produces PassthroughPipeline.Instance (zero overhead).

ScenarioStrategiesWhy
External HTTP APITimeout + Retry + CircuitBreakerTransient failures need retry; sustained outage needs circuit breaking; timeout prevents hanging
Database queryTimeout + Retry(2, Constant)Brief connection timeouts are retried; no circuit breaker because every query is independent
Shared resource (connection pool)Timeout + BulkheadLimit concurrency to match pool size; timeout prevents queued requests from waiting forever
Non-critical feature (recommendations)Timeout + Retry + FallbackRetry transient failures; return cached/default data if all else fails
Critical write (payment)Timeout + CircuitBreakerTimeout prevents hanging; circuit breaker prevents hammering a down service; no retry because payment could be processed but response lost

{
"Resilience": {
"Default": {
"Timeout": {
"Timeout": "00:00:30",
"TimeoutType": "Optimistic"
}
},
"Policies": {
"external-api": {
"Timeout": { "Timeout": "00:00:10" },
"Retry": {
"MaxRetries": 3,
"BaseDelay": "00:00:00.200",
"BackoffType": "Exponential",
"MaxDelay": "00:00:30",
"UseJitter": true
},
"CircuitBreaker": {
"FailureThreshold": 5,
"BreakDuration": "00:00:30"
}
},
"database": {
"Timeout": { "Timeout": "00:00:05" },
"Retry": {
"MaxRetries": 2,
"BackoffType": "Constant",
"BaseDelay": "00:00:00.100"
}
}
}
}
}

When IResiliencePipelineProvider.GetPipeline(name) is called, the provider checks in this order:

  1. Fluent overrides — policies registered via AddPolicy() on the provider instance
  2. Configuration — policies from ResilienceOptions.Policies dictionary
  3. DefaultResilienceOptions.Default if set
  4. PassthroughPassthroughPipeline.Instance (zero overhead, no wrapping)

This means a fluent override always wins over configuration, and configuration always wins over the default. Unknown policy names silently resolve to passthrough — no runtime errors.

AddPragmaticResilience() registers three services:

ServiceLifetimeDescription
IResiliencePipelineProviderSingletonResolves named pipelines by name or operation
ICircuitBreakerStateStoreSingletonDefault: InMemoryCircuitBreakerStateStore
ResilienceOptionsSingletonConfiguration via IOptions<ResilienceOptions>

The provider caches built pipelines by name using ConcurrentDictionary. The first call to GetPipeline("name") builds the pipeline; subsequent calls return the cached instance.

For standalone usage without dependency injection:

var stateStore = new InMemoryCircuitBreakerStateStore();
var pipeline = new ResiliencePipelineBuilder()
.AddRetry(o => { o.MaxRetries = 3; o.BackoffType = BackoffType.Exponential; })
.AddTimeout(o => o.Timeout = TimeSpan.FromSeconds(5))
.AddCircuitBreaker(stateStore, o => { o.FailureThreshold = 5; })
.Build();
var result = await pipeline.ExecuteAsync(
(ctx, ct) => httpClient.GetStringAsync(url, ct),
new ResilienceContext { OperationName = "FetchData" });

The builder sorts strategies by Order automatically. You can add them in any order — AddRetry before AddTimeout produces the same pipeline as the reverse.

If no strategies are added, Build() returns PassthroughPipeline.Instance.


Implement IResilienceStrategy to create strategies that are not built in:

public class RateLimitStrategy : IResilienceStrategy
{
public int Order => 150; // Between Timeout (100) and Bulkhead (200)
public async Task<TResult> ExecuteAsync<TResult>(
Func<ResilienceContext, CancellationToken, Task<TResult>> next,
ResilienceContext context,
CancellationToken ct)
{
await AcquireTokenAsync(ct);
return await next(context, ct);
}
}
var pipeline = new ResiliencePipelineBuilder()
.AddStrategy(new RateLimitStrategy())
.AddRetry()
.Build();

The Order property determines where your strategy sits in the pipeline. Choose a value between the built-in constants (StrategyOrder.Timeout = 100, StrategyOrder.Bulkhead = 200, etc.) to position it correctly.


ResilienceContext carries metadata through the pipeline. Every strategy receives the same context instance.

var context = new ResilienceContext
{
OperationName = "ChargeCustomer", // Required. Used in logging, metrics, tracing.
OperationKey = "payment-api" // Optional. Overrides OperationName for circuit key.
};
PropertyTypeDescription
OperationNamestring (required)Logical operation name. Used in logs, metrics, and activity names.
OperationKeystring?Circuit breaker isolation key. Defaults to OperationName if not set.
AttemptNumberintCurrent retry attempt (0 = first attempt). Updated by RetryStrategy.
TotalElapsedTimeSpanTotal elapsed time since the first attempt. Updated by the pipeline.
PropertiesIDictionary<string, object>Arbitrary key-value pairs for cross-strategy communication.

Use Properties to pass data between custom strategies without coupling them. For example, a rate limit strategy could store the remaining token count for a logging strategy to read.


Resilience errors implement Pragmatic.Result.Error for integration with the Result pattern. Each error has a code, HTTP status mapping, and descriptive title.

Error RecordCodeHTTP StatusWhen
TimeoutErrorTIMEOUT504 Gateway TimeoutOperation exceeded timeout duration
RetryExhaustedErrorRETRY_EXHAUSTED503 Service UnavailableAll retry attempts failed
CircuitBrokenErrorCIRCUIT_BROKEN503 Service UnavailableCircuit is open, requests rejected
BulkheadRejectedErrorBULKHEAD_REJECTED429 Too Many RequestsMax concurrency exceeded

Each strategy also throws a corresponding exception for pipeline-level control flow:

ExceptionThrown By
TimeoutRejectedExceptionTimeoutStrategy when timeout elapses
RetryExhaustedExceptionRetryStrategy when all retries fail (inner exception = last failure)
CircuitBrokenExceptionCircuitBreakerStrategy when circuit is open
BulkheadRejectedExceptionBulkheadStrategy when capacity is full

The error records are for mapping to Result<T, E> at the action/endpoint layer. The exceptions are for pipeline-level control flow between strategies.


A key design decision: only exceptions trigger resilience strategies. Result<T, E> failures pass through the pipeline unchanged.

[DomainAction]
[ResiliencePolicy("external-api")]
public partial class FetchUserAction : DomainAction<UserDto>
{
public override async Task<Result<UserDto, IError>> Execute(CancellationToken ct)
{
var user = await _userService.GetByIdAsync(Id, ct);
// This Result failure passes through -- NOT retried.
// "User not found" is a business outcome, not a transient failure.
if (user is null)
return new NotFoundError("User", Id);
// This exception IS retried (if HttpRequestException matches ShouldRetry).
var profile = await http.GetFromJsonAsync<Profile>($"/profiles/{user.ExternalId}", ct);
return UserDto.FromEntity(user, profile);
}
}

Why this matters:

  • Retrying a “not found” is wrong. The entity does not exist. Retrying will not make it appear.
  • Retrying a validation error is wrong. The input is invalid. Sending the same invalid input again will produce the same error.
  • Retrying an HttpRequestException is right. The server might have been temporarily unavailable.

This boundary is automatic. The SG wraps the Execute method — if it returns a Result.Failure, the pipeline sees a successful execution (no exception thrown) and passes the result through. If it throws, the pipeline applies the resilience strategies.


ActivitySource: "Pragmatic.Resilience"

Each pipeline execution creates an activity named Resilience.{policyName} (where policyName comes from OperationKey or defaults to "unknown"). Tags include:

  • policy.name — the resolved policy name
  • outcome"success", "retry", or "exception"
  • attempt — retry attempt number (if retried)

Meter: "Pragmatic.Resilience"

InstrumentTypeNameDescription
Pipeline durationHistogrampragmatic.resilience.durationExecution duration in milliseconds
Pipeline executionsCounterpragmatic.resilience.executionsTotal pipeline executions
Retry attemptsCounterpragmatic.resilience.retry_attemptsTotal retry attempts
Circuit rejectionsCounterpragmatic.resilience.circuit_rejectionsRequests rejected by open circuits
TimeoutsCounterpragmatic.resilience.timeoutsTotal timeout occurrences
Bulkhead rejectionsCounterpragmatic.resilience.bulkhead_rejectionsRequests rejected by bulkhead

All log messages use [LoggerMessage] source-generated partial methods for zero-allocation structured logging:

EventLevelMessage
Retry attemptWarningRetry attempt {N}/{Max} for {Op} after {Delay}ms. Error: {Msg}
TimeoutWarningOperation {Op} timed out after {Timeout}ms
Retry exhaustedErrorAll {Max} retry attempts exhausted for {Op}. Last error: {Msg}
Circuit rejectedWarningCircuit '{Key}' rejected request -- circuit is open
Circuit openedWarningCircuit '{Key}' opened after {N} consecutive failures
Bulkhead rejectedWarningBulkhead rejected '{Op}' -- max concurrency {N} reached
Fallback usedInformationFallback used for '{Op}'. Original error: {Msg}

Annotate a DomainAction with [ResiliencePolicy("name")] to automatically wrap execution with the named pipeline. The source generator detects Pragmatic.Resilience in the project references, injects IResiliencePipelineProvider, and wraps Execute with the resolved pipeline.

[DomainAction]
[ResiliencePolicy("payment-gateway")]
public partial class ChargeCustomerAction : DomainAction<PaymentResult>
{
public override async Task<Result<PaymentResult, IError>> Execute(CancellationToken ct)
{
// Wrapped by the "payment-gateway" pipeline
}
}

Error records (TimeoutError, RetryExhaustedError, CircuitBrokenError, BulkheadRejectedError) extend Pragmatic.Result.Error with HTTP status code mappings. When an endpoint catches a resilience exception, it can convert it to the typed error for proper HTTP response mapping (504, 503, 429).

When Pragmatic.Resilience is referenced in the project, the source generator auto-registers AddPragmaticResilience() with default settings via FeatureDetector. No manual registration is needed for basic usage. For custom policies, call AddPragmaticResilience(options => ...) in your IStartupStep.