File Operations

File copy, move, compression, and transformation operations within job pipelines.

This section covers two subsystems: the compression/decompression engine used by file.zip and file.unzip job steps, and the File Monitor system that watches directories for changes and triggers job executions.

8.1 Compression & Decompression

8.1.1 Unified Interface

All compression operations go through a common provider interface, resolved by format at runtime. This follows the same plugin pattern as the Job Step registry.

public interface ICompressionProvider
{
    string FormatKey { get; }                          // e.g., "zip", "tar.gz", "7z"
    Task CompressAsync(
        CompressRequest request,
        IProgress<CompressionProgress> progress,
        CancellationToken cancellationToken);
    Task DecompressAsync(
        DecompressRequest request,
        IProgress<CompressionProgress> progress,
        CancellationToken cancellationToken);
    Task<ArchiveContents> InspectAsync(               // List contents without extracting
        string archivePath,
        CancellationToken cancellationToken);
}

public record CompressRequest(
    IReadOnlyList<string> SourcePaths,                // Files or glob patterns
    string OutputPath,                                 // Destination archive path
    string? Password,                                  // Optional AES-256 encryption
    SplitArchiveConfig? SplitConfig);                  // Optional split into chunks

public record DecompressRequest(
    string ArchivePath,                                // Source archive (or first split part)
    string OutputDirectory,                            // Destination for extracted files
    string? Password);                                 // Optional decryption password

public record CompressionProgress(
    long BytesProcessed,
    long TotalBytes,
    string CurrentFile);

A CompressionProviderRegistry resolves the correct provider by format key. The file.zip and file.unzip job steps delegate entirely to this registry, keeping step logic thin.

8.1.2 Supported Formats

Format	Compress	Decompress	Password Support	Library
ZIP	Yes	Yes	AES-256	SharpZipLib
GZIP	Yes	Yes	No	SharpZipLib
TAR	Yes	Yes	No	SharpZipLib
TAR.GZ	Yes	Yes	No	SharpZipLib
7z	Yes	Yes	Yes	7z CLI via Process wrapper
RAR	No	Yes	Yes	SharpCompress

RAR creation is not supported because the RAR format is proprietary and licensing prohibits creation in third-party software. Extraction is supported via SharpCompress which implements the decompression algorithm under license.

For 7z, the most reliable cross-platform approach is wrapping the 7z CLI binary (bundled in the Docker image) via a managed Process call. Native .NET 7z libraries exist but have inconsistent support for advanced features like solid archives and multi-threaded compression. The CLI wrapper provides full feature parity and has been battle-tested.

7z CLI security hardening:

Invoking an external process from a server application is inherently risky. The following protections are mandatory:

No shell invocation: The wrapper uses ProcessStartInfo with UseShellExecute = false and passes arguments as individual elements, never through shell interpretation. This eliminates command injection via filenames or passwords.

var psi = new ProcessStartInfo
{
    FileName = "/usr/bin/7z",          // Absolute path — no PATH lookup
    UseShellExecute = false,           // No shell — exec() directly
    CreateNoWindow = true,
    RedirectStandardOutput = true,
    RedirectStandardError = true,
    Environment = {                    // Minimal environment
        ["PATH"] = "/usr/bin",
        ["LANG"] = "C.UTF-8"
    }
};

// Arguments are added individually — never string-interpolated
psi.ArgumentList.Add("a");                              // command: add
psi.ArgumentList.Add("-t7z");                           // format
psi.ArgumentList.Add($"-p{password}");                  // password (if any)
psi.ArgumentList.Add("-mhe=on");                        // encrypt headers
psi.ArgumentList.Add(sanitizedOutputPath);              // output archive
psi.ArgumentList.Add(sanitizedInputPath);               // input file(s)

Filename sanitization: All filenames are validated before being passed to the CLI. The validation rejects: path traversal sequences (.., ./), absolute paths outside the job's temp directory, null bytes, control characters, and shell metacharacters (`, $, |, ;, &, >, <). Filenames that fail validation cause the step to fail with error 7001: Unsafe filename in archive operation.

Temp directory isolation: All 7z operations execute within the job's unique temp directory (/data/courier/temp/\{executionId\}/). The wrapper sets ProcessStartInfo.WorkingDirectory to this path and validates that all input and output paths resolve within it (after symlink resolution). This prevents a malicious archive from extracting files outside the sandbox.

private string ValidatePathWithinSandbox(string path, string sandboxDir)
{
    var resolved = Path.GetFullPath(path);
    if (!resolved.StartsWith(sandboxDir, StringComparison.Ordinal))
        throw new SecurityException(
            $"Path escapes sandbox: {path} resolves to {resolved}");
    return resolved;
}

Path traversal on extraction (Zip Slip): When extracting archives, the wrapper validates every entry path before writing. Archive entries with names containing .. or absolute paths are rejected. This prevents the classic Zip Slip attack where a malicious archive extracts files to arbitrary locations.

// After extraction, verify no file escaped the sandbox
var extractedFiles = Directory.EnumerateFiles(outputDir, "*", SearchOption.AllDirectories);
foreach (var file in extractedFiles)
{
    var resolved = Path.GetFullPath(file);
    if (!resolved.StartsWith(outputDir, StringComparison.Ordinal))
    {
        // This should never happen if 7z respects paths, but defense in depth
        File.Delete(resolved);
        throw new SecurityException($"Extracted file escaped sandbox: {file}");
    }
}

Hard timeouts: The Process is given a hard timeout matching the job step's timeout_seconds configuration (default: 300 seconds). If the process does not exit within the timeout, it is killed via Process.Kill(entireProcessTree: true). The step transitions to TimedOut state.

Resource limits: The 7z process inherits the container's cgroup resource limits (CPU and memory), which are set in the Docker deployment (Section 14.2). No additional per-process limits are applied in V1 — the container limits are the boundary. If 7z exceeds the container's memory limit, the OOM killer terminates it, and the step fails with a descriptive error.

Stdout/stderr capture and sanitization: The wrapper captures stdout and stderr for progress tracking and error reporting. Before logging, the output is sanitized to remove any content that might contain sensitive data (passwords are passed via arguments, but some 7z errors echo command-line context). Stderr is truncated to 4KB before being stored in the step's audit record.

Binary integrity: The 7z binary is installed via apt-get install p7zip-full in the Dockerfile from the distro's official package repository. The binary path is hardcoded to /usr/bin/7z — no PATH lookup, no user-configurable binary location.

8.1.3 Streaming Architecture

All compression operations use Stream-based pipelines. Files are never loaded fully into memory. This is critical for the 6–10 GB files Courier must handle.

For compression, source files are read through a FileStream with a configurable buffer size (default: 81920 bytes / 80KB) and fed into the compression stream. For decompression, the archive stream is read and entries are extracted directly to FileStream output targets.

The pipeline for a typical compress operation:

FileStream (source) → CompressionStream (e.g., ZipOutputStream) → FileStream (output archive)

Memory usage stays bounded regardless of file size. The buffer size is configurable per step for tuning throughput vs. memory on constrained environments.

8.1.4 Progress Reporting

Compression steps report progress back to the Job Engine via the IProgress<CompressionProgress> callback. The engine uses this to:

Update the job_audit_log with bytes-processed metrics
Provide real-time progress data for the eventual V2 UI progress bars
Detect stalls — if no progress is reported within the step's timeout window, the step is timed out

Progress is reported at the file level (after each file is compressed/extracted) and at the byte level within large individual files (every 10MB processed).

8.1.5 Multi-File Handling

Compression steps accept multiple input sources via:

Explicit file list: An array of absolute paths in the step configuration
Glob patterns: Patterns like *.csv or invoice_2026*.pdf resolved against a base directory
JobContext references: Upstream step outputs (e.g., "0.downloaded_files") that resolve to one or more file paths

All matched files are included in the output archive. If no files match, the step fails with a descriptive error rather than creating an empty archive.

Decompression extracts all entries by default. A future enhancement could support selective extraction via filename patterns.

8.1.6 Split Archives

For ZIP archives, Courier supports splitting output into multiple parts when the total size would exceed a configurable threshold. This is opt-in per step configuration:

{
    "split_config": {
        "enabled": true,
        "max_part_size_mb": 500
    }
}

When enabled, the output is written as archive.zip, archive.z01, archive.z02, etc. Decompression of split archives is handled transparently — the step configuration points to the first part and SharpZipLib reassembles automatically.

Split archives are most useful when downstream systems have file size limits (e.g., email attachments in V2, or partner SFTP servers with quota restrictions).

8.1.7 Temp File Strategy

Compression operations write output to the job execution's temp directory first (/data/courier/temp/\{executionId\}/). Once the archive is fully written and validated, it is atomically moved (or copied, if crossing filesystem boundaries) to its configured destination path.

This prevents partial archives from appearing at the destination if the process is interrupted mid-compression. The temp directory lifecycle is managed by the Job Engine as described in Section 5.14.

8.1.8 Archive Extraction Safety

Extracting untrusted archives is a well-documented attack surface. Courier treats every archive as potentially hostile and enforces the following protections during decompression:

Zip Slip (path traversal): Archive entries with names containing .., absolute paths, or paths that resolve outside the extraction sandbox are rejected before writing. This is checked both pre-extraction (by inspecting the entry name) and post-extraction (by resolving the written file's actual path via Path.GetFullPath() and comparing against the sandbox). See Section 8.1.2 for the 7z CLI-specific implementation; SharpZipLib and SharpCompress extractions use the same ValidatePathWithinSandbox() check.

Symlink and hardlink attacks: Archive formats (TAR, 7z) can contain symlink entries that point outside the extraction directory. On extraction, if a symlink target resolves outside the sandbox, the entry is skipped and logged as a security warning. Hardlinks to files outside the sandbox are similarly rejected. After extraction, a sweep verifies no symlinks escaped.

private void ValidateExtractedEntry(string entryPath, string sandboxDir)
{
    var resolved = Path.GetFullPath(entryPath);
    if (!resolved.StartsWith(sandboxDir, StringComparison.Ordinal))
        throw new SecurityException($"Archive entry escapes sandbox: {entryPath}");

    // Check for symlinks pointing outside sandbox
    var fileInfo = new FileInfo(resolved);
    if (fileInfo.LinkTarget is not null)
    {
        var linkTarget = Path.GetFullPath(
            Path.Combine(Path.GetDirectoryName(resolved)!, fileInfo.LinkTarget));
        if (!linkTarget.StartsWith(sandboxDir, StringComparison.Ordinal))
        {
            File.Delete(resolved);
            throw new SecurityException(
                $"Symlink escapes sandbox: {entryPath} → {fileInfo.LinkTarget}");
        }
    }
}

Decompression bomb protection: A malicious archive can contain a small compressed file that expands to terabytes (e.g., a "zip bomb"). Courier enforces limits during extraction:

Limit	Default	System Setting	Purpose
Max total uncompressed bytes	20 GB	`archive.max_uncompressed_bytes`	Prevents disk exhaustion. Set slightly above the largest expected legitimate archive (2× the max file size target).
Max file count	10,000	`archive.max_file_count`	Prevents inode exhaustion and excessive processing time.
Max compression ratio	200:1	`archive.max_compression_ratio`	A 1 MB archive expanding to 200 MB is normal; expanding to 200 GB is a bomb. Ratio is checked per-entry and across the entire archive.
Max nesting depth	0 (no nested extraction)	`archive.max_nesting_depth`	Nested archives (a .zip inside a .zip) are not extracted recursively in V1. The inner archive is treated as a regular file.
Max single entry size	10 GB	`archive.max_entry_size`	Prevents a single entry from consuming all available disk.

Limits are checked during extraction, not just after. The extraction stream tracks cumulative bytes written and entry count, aborting immediately when any limit is exceeded:

private void CheckExtractionLimits(
    long bytesWrittenSoFar, int filesExtractedSoFar,
    long entryCompressedSize, long entryUncompressedSize)
{
    if (bytesWrittenSoFar > _maxUncompressedBytes)
        throw new ArchiveSafetyException(
            $"Total uncompressed size exceeds limit ({_maxUncompressedBytes} bytes)");

    if (filesExtractedSoFar > _maxFileCount)
        throw new ArchiveSafetyException(
            $"File count exceeds limit ({_maxFileCount})");

    if (entryCompressedSize > 0)
    {
        var ratio = (double)entryUncompressedSize / entryCompressedSize;
        if (ratio > _maxCompressionRatio)
            throw new ArchiveSafetyException(
                $"Compression ratio {ratio:F1}:1 exceeds limit ({_maxCompressionRatio}:1)");
    }
}

8.1.9 Archive Integrity Verification

After creating an archive, the engine optionally runs a validation pass to confirm integrity. This is enabled by default and can be disabled per step:

ZIP/7z: Open the archive and verify all entries can be read without errors. For password-protected archives, verify decryption succeeds.
TAR/TAR.GZ/GZIP: Verify the archive can be fully read without stream corruption.
RAR: Not applicable (extraction-only).

Validation adds overhead proportional to archive size but catches corruption from disk errors or interrupted writes before the file is sent downstream.

8.1.10 Archive Inspection

Archives can be inspected without extraction using the InspectAsync method on any compression provider. This returns:

public record ArchiveContents(
    string Format,
    long TotalSizeCompressed,
    long TotalSizeUncompressed,
    double CompressionRatio,
    bool IsPasswordProtected,
    IReadOnlyList<ArchiveEntry> Entries);

public record ArchiveEntry(
    string Path,                       // Relative path within archive
    long CompressedSize,
    long UncompressedSize,
    DateTime LastModified,
    bool IsDirectory);

This supports two use cases: validation steps within a job pipeline (e.g., confirm an expected file exists in the archive before proceeding), and the frontend UI where users can preview archive contents when configuring jobs.