Architecture Overview
System architecture, deployment units, and dependency layers for Courier MFT.
Courier is a three-tier application deployed as three independent processes: a REST API host, a background Worker host, and a Next.js frontend. All three share a single PostgreSQL database and communicate indirectly through the database and Azure Key Vault — there is no inter-process messaging bus in V1. This polling-based coordination has a documented throughput ceiling (~50–100 jobs/hour, 3–10s pickup latency) that is acceptable for V1's target workload; Section 15 describes the migration to event-driven scheduling.
2.1 System Context
┌─────────────────────────────────────────────────────────────────────────┐
│ COURIER PLATFORM │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────┐ │
│ │ Frontend │────►│ API Host │ │ Worker Host │ │
│ │ (Next.js) │◄────│ (ASP.NET Core) │ │ (.NET Worker) │ │
│ │ │ │ │ │ │ │
│ │ • Dashboard │HTTPS│ • REST API │ │ • Job Engine │ │
│ │ • Job Builder│ │ • Auth (Entra) │ │ • Quartz Scheduler│ │
│ │ • Monitor UI │ │ • Validation │ │ • File Monitors │ │
│ │ • Key Mgmt │ │ • OpenAPI/Swagger│ │ • Key Rotation │ │
│ │ • Audit Log │ │ │ │ • Partition Maint. │ │
│ └──────────────┘ └────────┬─────────┘ └─────────┬──────────┘ │
│ │ │ │
│ ┌───────▼──────────────────────────▼──────┐ │
│ │ PostgreSQL 16+ │ │
│ │ │ │
│ │ Jobs, Connections, Keys, Monitors, │ │
│ │ Executions, Audit Log, Quartz Tables │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│ │ │
│ │ │
┌───────▼──────┐ ┌───────▼──────┐ ┌────────▼────────┐
│ Azure Entra │ │ Azure Key │ │ Partner SFTP/ │
│ ID (Auth) │ │ Vault │ │ FTP Servers │
└──────────────┘ └──────────────┘ └─────────────────┘
2.2 Deployment Units
| Unit | Technology | Responsibilities | Scaling |
|---|---|---|---|
| API Host | ASP.NET Core 10 | REST API, authentication, request validation, CRUD operations, OpenAPI spec | Horizontal — stateless, any number of replicas behind a load balancer |
| Worker Host | .NET 10 Worker Service | Quartz.NET scheduler, Job Engine execution, File Monitor polling, key rotation checks, partition maintenance | Single instance (V1) — Quartz AdoJobStore handles clustered failover if scaled later |
| Frontend | Next.js (standalone) | User interface, OAuth 2.0 Authorization Code + PKCE flow, API consumption | Horizontal — self-contained Node.js server in container |
Why separate API and Worker? The API host is request-driven and benefits from horizontal scaling. The Worker host is long-running and CPU/IO-bound (file transfers, encryption, compression). Separating them allows independent scaling, independent deployment, and prevents a runaway job from starving API response times. Both processes share the same domain logic via shared class libraries.
2.3 Internal Architecture — Vertical Slices
Courier organizes code by feature, not by technical layer. Each feature folder contains everything needed for that domain: API controllers, request/response DTOs, validators, application services, domain entities, and infrastructure adapters.
Solution structure:
Courier.sln
│
├── src/
│ ├── Courier.Api/ ← API Host (ASP.NET Core)
│ │ ├── Program.cs ← Startup, middleware, DI
│ │ ├── Middleware/ ← Exception handler, auth, CORS
│ │ └── appsettings.json
│ │
│ ├── Courier.Worker/ ← Worker Host (.NET Worker Service)
│ │ ├── Program.cs ← Startup, hosted services, DI
│ │ └── appsettings.json
│ │
│ ├── Courier.Features/ ← Shared feature library (API + Worker)
│ │ ├── Jobs/
│ │ │ ├── Entities/ ← Job, JobStep, JobVersion, JobExecution, StepExecution
│ │ │ ├── Dtos/ ← JobDto, CreateJobRequest, JobFilter
│ │ │ ├── Validators/ ← CreateJobValidator, UpdateJobValidator
│ │ │ ├── Services/ ← JobService, JobExecutionService
│ │ │ ├── Controllers/ ← JobsController
│ │ │ ├── StepTypes/ ← IStepHandler implementations
│ │ │ └── Mapping/ ← EF Core entity configuration
│ │ │
│ │ ├── Chains/
│ │ │ ├── Entities/
│ │ │ ├── Dtos/
│ │ │ ├── Validators/
│ │ │ ├── Services/
│ │ │ ├── Controllers/
│ │ │ └── Mapping/
│ │ │
│ │ ├── Connections/
│ │ │ ├── Entities/ ← Connection, KnownHost
│ │ │ ├── Dtos/
│ │ │ ├── Validators/
│ │ │ ├── Services/ ← ConnectionService, ConnectionTester
│ │ │ ├── Controllers/
│ │ │ ├── Protocols/ ← SftpClient, FtpClient, FtpsClient adapters
│ │ │ └── Mapping/
│ │ │
│ │ ├── Keys/
│ │ │ ├── Pgp/ ← PgpKey entity, PGP services, controllers
│ │ │ ├── Ssh/ ← SshKey entity, SSH services, controllers
│ │ │ └── Shared/ ← ICryptoProvider, Key rotation service
│ │ │
│ │ ├── Monitors/
│ │ │ ├── Entities/ ← FileMonitor, MonitorJobBinding, MonitorFileLog
│ │ │ ├── Dtos/
│ │ │ ├── Validators/
│ │ │ ├── Services/ ← MonitorService, LocalWatcher, RemotePoller
│ │ │ ├── Controllers/
│ │ │ └── Mapping/
│ │ │
│ │ ├── Tags/
│ │ │ ├── Entities/ ← Tag, EntityTag
│ │ │ ├── Dtos/
│ │ │ ├── Services/
│ │ │ ├── Controllers/
│ │ │ └── Mapping/
│ │ │
│ │ ├── Audit/
│ │ │ ├── Entities/ ← AuditLogEntry, DomainEvent
│ │ │ ├── Dtos/
│ │ │ ├── Services/ ← AuditService
│ │ │ ├── Controllers/
│ │ │ └── Mapping/
│ │ │
│ │ ├── Dashboard/
│ │ │ ├── Dtos/ ← SummaryDto, RecentExecutionDto
│ │ │ ├── Services/ ← DashboardService
│ │ │ └── Controllers/
│ │ │
│ │ └── Settings/
│ │ ├── Entities/ ← SystemSetting
│ │ ├── Dtos/
│ │ ├── Services/
│ │ ├── Controllers/
│ │ └── Mapping/
│ │
│ ├── Courier.Domain/ ← Shared domain primitives
│ │ ├── Common/ ← ApiResponse<T>, ErrorCodes, enums
│ │ ├── ValueObjects/ ← FailurePolicy, StepConfiguration, etc.
│ │ └── Interfaces/ ← ITransferClient, ICryptoProvider, IStepHandler
│ │
│ ├── Courier.Infrastructure/ ← Cross-cutting infrastructure
│ │ ├── Persistence/ ← CourierDbContext, global filters, interceptors
│ │ ├── Encryption/ ← EnvelopeEncryptionService, KeyVaultClient
│ │ ├── Compression/ ← ZIP, GZIP, TAR, 7z providers
│ │ └── Migrations/ ← DbUp runner, embedded SQL scripts
│ │
│ └── Courier.Frontend/ ← Next.js project (separate build pipeline)
│ ├── src/
│ │ ├── app/ ← Next.js app router
│ │ ├── components/ ← Shared UI components
│ │ └── lib/ ← API client, auth, utilities
│ └── package.json
│
├── tests/
│ ├── Courier.Tests.Unit/
│ ├── Courier.Tests.Integration/
│ └── Courier.Tests.Architecture/
│
└── infra/ ← Deployment configs
├── docker/ ← Dockerfiles for API, Worker
├── k8s/ ← Kubernetes manifests
└── scripts/ ← CI/CD, seed data
Key principle: Feature folders own their entire vertical slice. If you need to understand how Jobs work, you open Courier.Features/Jobs/ — the entities, DTOs, validators, services, controllers, and EF mappings are all there. Cross-cutting concerns (database context, encryption, compression) live in Courier.Infrastructure and are injected via DI.
2.4 Dependency Rules
┌───────────────────────────────────────────┐
│ Courier.Api │ ← Thin host: startup, middleware
│ Courier.Worker │ ← Thin host: startup, hosted services
├───────────────────────────────────────────┤
│ Courier.Features │ ← All feature slices
├───────────────────────────────────────────┤
│ Courier.Infrastructure │ ← EF Core, encryption, compression
├───────────────────────────────────────────┤
│ Courier.Domain │ ← Entities, interfaces, value objects
└───────────────────────────────────────────┘
References flow downward only. No project references upward.
Courier.Api→ referencesCourier.Features,Courier.Infrastructure,Courier.DomainCourier.Worker→ referencesCourier.Features,Courier.Infrastructure,Courier.DomainCourier.Features→ referencesCourier.Infrastructure,Courier.DomainCourier.Infrastructure→ referencesCourier.DomainCourier.Domain→ references nothing (no NuGet dependencies except value types)
These rules are enforced by architecture tests in Courier.Tests.Architecture using NetArchTest or ArchUnitNET.
2.5 Request Flow (API)
A typical API request flows through the following pipeline:
Client Request (HTTPS)
│
▼
┌─────────────────────────┐
│ Kestrel / Reverse Proxy │
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ ApiExceptionMiddleware │ ← Catches unhandled exceptions → ApiResponse envelope
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ CORS Middleware │ ← Validates origin against Frontend:Origin
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ Authentication │ ← Validates Entra ID JWT bearer token
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ Authorization │ ← Checks [Authorize(Roles = "...")] attributes
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ Security Headers │ ← X-Content-Type-Options, CSP, HSTS, etc.
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ Serilog Request Logging │ ← Structured log with method, path, status, duration
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ Controller Action │ ← Route matched, model binding
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ FluentValidation Filter │ ← Validates request body → 400 if invalid
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ Application Service │ ← Business logic, domain operations
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ EF Core / DbContext │ ← Query or persist via PostgreSQL
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ ApiResponse<T> Envelope │ ← Wrap result in standard response model
└────────────┬────────────┘
▼
JSON Response
2.6 Execution Flow (Worker)
A scheduled job execution flows through the Worker host:
Quartz.NET Trigger Fires
│
▼
┌─────────────────────────┐
│ QuartzJobAdapter │ ← Quartz IJob → resolves Courier's JobExecutionService via DI
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ JobExecutionService │ ← Creates JobExecution record, loads Job + Steps + Version
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ Step Loop │ ← For each Step in order:
│ │ │
│ │ ┌─────────────────┐ │
│ │ │ IStepHandler │ │ ← Resolved by typeKey from DI container
│ │ │ .ExecuteAsync() │ │
│ │ └────────┬────────┘ │
│ │ │ │
│ │ ┌────────▼────────┐ │
│ │ │ Transfer / │ │ ← ITransferClient, ICryptoProvider, ICompressionProvider
│ │ │ Encrypt / │ │
│ │ │ Compress │ │
│ │ └────────┬────────┘ │
│ │ │ │
│ │ StepExecution saved │ ← State, duration, bytes, output
│ │ JobContext updated │ ← Output variables for next step
│ │ │
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ JobExecution completed │ ← Final state: Completed / Failed
│ Temp directory cleaned │ ← Immediate cleanup on completion
│ Audit event logged │
│ Downstream triggers │ ← Dependent jobs / chain next member
└─────────────────────────┘
2.7 Data Flow Between API and Worker
The API and Worker hosts do not communicate directly. They coordinate through the database:
┌──────────────┐ ┌──────────────┐
│ API Host │ │ Worker Host │
│ │ │ │
│ POST /jobs/{id}/execute │ │
│ │ │ │
│ ▼ │ │
│ Insert job_executions │ │
│ row (state: 'queued') ──────────────────► │ Quartz polls│
│ PostgreSQL │ AdoJobStore │
│ Insert job_schedules ──────────────────► │ │
│ row (cron expression) │ Quartz picks│
│ │ up trigger │
│ │ │ │
│ GET /jobs/{id}/executions │ ▼ │
│ │ │ Execute job │
│ ▼ │ Update rows │
│ Read job_executions ◄──────────────────────│ (state, │
│ rows (state, results) PostgreSQL │ results) │
│ │ │
└──────────────┘ └──────────────┘
For manual execution (POST /api/v1/jobs/\{id\}/execute), the API host creates a job_executions record with state queued and schedules an immediate Quartz trigger. The Worker's Quartz scheduler picks up the trigger via its AdoJobStore poll interval and begins execution. The API host reads execution status by querying the same job_executions table.
Throughput ceiling and known limitations:
Database-as-bus is a deliberate V1 tradeoff: zero additional infrastructure, simple debugging (query the tables), and transactional consistency. It has predictable limits that should inform capacity planning:
| Metric | V1 Design Point | Bottleneck |
|---|---|---|
| Job throughput | ~50–100 jobs/hour | Concurrency limit (default 5) × avg job duration. Most file transfer jobs run 10–120 seconds. |
| Queue poll latency | 3–10 seconds (p95) | Quartz AdoJobStore poll interval (default 5s) + queue dequeue poll (default 5s). Worst case on a fresh trigger is the sum of both intervals. |
| File Monitor throughput | ~200 files/minute (local), ~30 files/minute (remote per connection) | Local: FileSystemWatcher + stability window. Remote: poll interval × connection overhead. |
| Typical file size | 1 KB – 500 MB | Streaming architecture handles large files. Memory usage scales with step buffer size (default 8 KB), not file size. |
| Concurrent polling load | Quartz (1 query/5s) + queue dequeue (1 query/5s) + N monitors (1 query/interval each) | Single Worker instance: ~2–10 queries/second to PostgreSQL. Negligible for a dedicated database. |
Known failure modes at scale:
- Poll jitter under load: When the database is under heavy write load (e.g., bulk audit logging during many concurrent jobs), poll queries experience variable latency. This manifests as inconsistent job pickup times. Mitigation: Quartz and queue polls use dedicated read connections, not the same connection pool as writes.
- Thundering herd on restart: If the Worker restarts with a backlog of queued jobs, Quartz fires all pending triggers simultaneously, exceeding the concurrency limit. Mitigation: the concurrency semaphore (Section 5.8) gates actual execution — excess triggers enter
Queuedstate and wait. - No backpressure from API to Worker: The API can queue jobs faster than the Worker can execute them. There is no feedback mechanism to slow down callers. Mitigation: the API returns the queue position in the response, and the dashboard shows queue depth. Alert on queue depth > configurable threshold.
- "Exactly once" is not guaranteed: Database polling provides at-least-once pickup semantics.
FOR UPDATE SKIP LOCKED(Section 5.8) prevents duplicate pickup in the steady state, but crash recovery could re-execute a job that was in progress. Mitigation: jobs are designed to be re-runnable (overwrite semantics on upload), and thejob_executionstable records the outcome of each attempt.
These limits are acceptable for V1's target workload (internal file transfer operations, not high-frequency event processing). Section 15 documents the V2 migration to event-driven scheduling that removes the polling bottleneck.
2.8 External Integration Points
| External System | Direction | Protocol | Purpose | Section |
|---|---|---|---|---|
| Azure Entra ID | Inbound | OAuth 2.0 / OIDC | User authentication, role claims | 12.1 |
| Azure Key Vault | Outbound | HTTPS (REST) | Master key wrap/unwrap, application secrets | 7, 12 |
| Azure Blob Storage | Outbound | HTTPS (REST) | Archived partition data (cold storage) | 13.6 |
| Azure Application Insights | Outbound | HTTPS | Telemetry, tracing, alerting (prod) | 3.10 |
| Seq | Outbound | HTTP | Structured log search (dev only) | 3.10 |
| Azure Function Apps | Outbound | HTTPS (Admin API) | Trigger serverless functions as job steps; poll completion via App Insights | 5.2, 6.1 |
| Azure Log Analytics | Outbound | HTTPS (REST) | Query Application Insights for function execution status and traces | 5.2 |
| Partner SFTP servers | Outbound | SFTP (SSH) | File transfer — upload, download, directory listing | 6 |
| Partner FTP/FTPS servers | Outbound | FTP / FTPS | File transfer — upload, download, directory listing | 6 |
| PostgreSQL | Both | TCP (SSL) | Primary data store | 13 |
2.9 Cross-Cutting Concerns
| Concern | Implementation | Owner |
|---|---|---|
| Authentication | Entra ID JWT validation via Microsoft.Identity.Web | API Host middleware |
| Authorization | Role-based [Authorize] attributes (Admin, Operator, Viewer) | API Host controllers |
| Logging | Serilog → Seq (dev) / App Insights (prod) with sensitive data redaction | Both hosts |
| Audit | AuditService writes to audit_log_entries on every state change | Both hosts |
| Encryption | EnvelopeEncryptionService wrapping Key Vault + AES-256-GCM | Both hosts |
| FIPS compliance | Algorithm restrictions + validated module detection (Section 12.10) | Both hosts |
| Error handling | ApiExceptionMiddleware (API), try/catch in hosted services (Worker) | Per host |
| Health checks | .NET Aspire health endpoints (DB, Key Vault, Quartz, disk space) | Both hosts |
| Configuration | Key Vault (prod) + User Secrets (dev) via .NET Configuration | Both hosts |
2.10 Architecture Decision Records
Key architectural decisions and their rationale, for future reference:
| Decision | Chosen | Alternatives Considered | Rationale |
|---|---|---|---|
| Three deployables | API + Worker + Frontend | Monolith, two deployables | Independent scaling; CPU-bound jobs don't starve API; independent deploy cycles |
| Vertical slices | Feature folders | Layered architecture | Cohesion by feature; easier to navigate; each slice owns its full stack |
| Database as coordination | PostgreSQL polling | RabbitMQ, Azure Service Bus | V1 simplicity; no additional infrastructure; Quartz already polls. Ceiling: ~50–100 jobs/hour, 3–10s pickup latency. See Section 2.7 for throughput limits. |
| EF Core (query only) | DbUp for migrations | EF Core migrations | Raw SQL gives full control over partitioning, triggers, indexes; DbUp is simpler for teams with strong SQL skills |
| Single PostgreSQL instance | One database, all tables | Separate databases per concern | Simpler operations; transactional consistency; partitioning handles scale |
| Quartz.NET for scheduling | AdoJobStore | Hangfire, custom timer | Mature, persistent, cron support, clustered failover, battle-tested |
| BouncyCastle for PGP | FIPS-approved algorithms only | GnuPG CLI, custom PGP | Only .NET library with full PGP format support; FIPS algorithms enforced in config |
| Azure Key Vault for KEK | Envelope encryption | Local key file, AWS KMS | Azure-native; FIPS 140-2 Level 2/3; hardware-backed; no key material on disk |
| System.Text.Json primary | Newtonsoft for Quartz only | Newtonsoft everywhere | Performance; .NET native; smaller dependency surface; Quartz requires Newtonsoft |
| No inter-process messaging (V1) | Database polling + FOR UPDATE SKIP LOCKED | RabbitMQ, gRPC, SignalR | V1 simplicity; poll jitter and DB load are acceptable at target throughput. V2 migrates to event-driven scheduling via outbox + message bus (Section 15). |
2.11 Non-Functional Requirements & Design Targets
Without explicit targets, claims like "polling is fine" and "partition monthly" cannot be evaluated. These are the design-point assumptions for V1. They are not SLAs — they are the workload profile the architecture was designed to support. Exceeding them requires the V2 changes documented in Section 15.
Throughput & capacity:
| Metric | V1 Design Target | Notes |
|---|---|---|
| Max file size (per step) | 10 GB | Streaming architecture; memory bounded to ~2× buffer size (default 80KB). Tested to 10 GB; larger files should work but are not validated. |
| Concurrent job executions | 5 (configurable up to 20) | Global semaphore, not per-job. Bounded by Worker CPU/memory and IOPS. |
| Job throughput | ~50–100 jobs/hour | Depends on avg job duration. Bottleneck is concurrency limit × job runtime. |
| File Monitor throughput | ~200 files/min (local), ~30 files/min (remote) | Local: limited by FileSystemWatcher + stability window. Remote: limited by poll interval + connection overhead. |
| Concurrent active monitors | 50 | Beyond this, poll scheduling contention and database load from directory state become measurable. |
| Concurrent transfers per connection | 1 | SSH.NET and FluentFTP connections are not shared across jobs. Each job opens its own connection. |
| PGP/SSH keys stored | ~500 | No hard limit; performance degrades on key list queries if tags are heavily used and unindexed. |
| Audit log write rate | ~10–50 entries/second sustained | Partitioned by month; insert performance is stable. Querying across partitions degrades beyond 12 months of retained data. |
Latency:
| Metric | V1 Design Target | Notes |
|---|---|---|
| Job pickup latency (queued → running) | 3–10 seconds (p95) | Sum of Quartz poll interval + queue dequeue poll. |
| API response time (p95) | < 200ms | For CRUD operations. Excludes connection test (network-bound) and key generation (CPU-bound). |
| File Monitor detection (local, new file) | < 10 seconds | Watcher provides ~instant detection; stability window adds 5s before trigger. |
| File Monitor detection (remote, new file) | 1–2× poll interval | Depends on configured interval (min 30s, recommended 60s+). |
| Key Vault wrap/unwrap latency | ~20ms per operation | Azure Key Vault REST call. Adds to every encrypt/decrypt operation. |
Retention & storage:
| Metric | V1 Design Target | Notes |
|---|---|---|
| Audit log retention | 12 months online | Monthly partitions. Older partitions archived to Azure Blob (cold storage). |
| Job execution history | 12 months online | Same partitioning strategy as audit log. |
| Temp directory retention (orphaned) | 7 days | Background cleanup service purges. |
| Database size (1 year, typical) | ~10–50 GB | Depends heavily on audit log volume and JSONB column sizes. |
Availability & recovery:
| Metric | V1 Design Target | Notes |
|---|---|---|
| RPO (Recovery Point Objective) | < 1 hour | Azure Database for PostgreSQL continuous backup with PITR. RPO depends on WAL archival frequency. |
| RTO (Recovery Time Objective) | < 30 minutes | Container restart + migration check + Quartz re-acquisition. Does not include database restore time if the DB itself is lost. |
| Planned downtime tolerance | < 5 minutes | Rolling deployment: API hosts can be cycled independently. Worker requires brief stop for Quartz trigger handoff. |
| Unplanned Worker crash | Jobs in Running state are marked Failed on next startup. Queued jobs are re-picked up automatically. | No automatic failover to a second Worker in V1. |
What these targets do NOT cover (V2):
- Multi-region or active-active deployment
- Sub-second job pickup latency (requires event-driven scheduling)
- Horizontal Worker scaling (requires Quartz cluster mode + event bus)
- Zero-downtime database migrations
- Formal SLA commitments with contractual penalties