Slot reallocation governance protocol
Background
The Swarm Control System governs the coordination, navigation, and collision avoidance of thousands of satellites operating in heliocentric orbit. The consensus architecture implements an "Ephemeris Governance" model rather than rigid formation flying—each node is assigned an orbital element window and keep-out tube defining its permitted operational volume. This approach, combined with the three-tier federated architecture (individual nodes, ~100-node clusters, and 3–5 beacon/relay spacecraft), creates a distributed system where slot assignments are fundamental to collision avoidance and swarm coherence.
The question of slot reallocation governance arises directly from the consensus requirement for collision probability <10⁻⁶ per node-year and the acceptance of 1–3% annual node failure rate using automotive-grade components. With a Phase 1 deployment of 1,000–3,000 nodes, this translates to 10–90 node failures annually. Each failure creates an orphaned slot that must be either reassigned, quarantined, or absorbed—while simultaneously preventing the failed node's uncontrolled drift from triggering cascading conjunction events with neighboring satellites.
Why This Matters
Slot reallocation governance is a critical failure-handling mechanism with direct implications for swarm safety and operational continuity. Without a well-defined protocol:
Cascading Conflicts: A drifting failed node may violate keep-out tubes of adjacent slots, forcing emergency avoidance maneuvers that consume limited ΔV budgets (0.5–5 m/s/year per GPT specifications). Multiple simultaneous avoidances could trigger chain reactions across cluster boundaries.
Density Violations: The consensus specifies distributed conjunction screening with beacon-broadcast catalogs. If slot reassignment lags behind failure detection, the ephemeris catalog becomes stale, degrading collision prediction accuracy below the 10⁻⁶ threshold.
Resource Stranding: Orphaned slots represent lost energy collection capacity. Efficient reallocation enables replacement nodes to occupy vacated positions, maintaining swarm power output during the 50-year operational lifetime.
Autonomy Requirements: Nodes must survive 7–30+ days without ground contact. The governance protocol must function entirely within the autonomous decision-making envelope, executed by cluster coordinators and beacon spacecraft without human-in-the-loop approval for time-critical reassignments.
This question directly impacts the software architecture of the formally verified seL4 kernel, the beacon catalog broadcast format, and the cluster coordinator duty cycle (itself an open question in the consensus).
Key Considerations
Failure Detection Latency: How quickly can a node failure be confirmed? The consensus specifies ≤10 ms swarm-wide time synchronization, but detecting a non-responsive node requires multiple missed heartbeats. False positives (temporary communication loss) must not trigger premature slot reassignment.
Slot Geometry Constraints: Keep-out tubes and orbital element windows have physical dimensions tied to navigation accuracy (±1 m to ≤5 km depending on model). Reassignment must respect minimum separation distances while maximizing packing efficiency.
Authority Hierarchy: The three-tier architecture creates ambiguity—should cluster coordinators (Tier 2) authorize intra-cluster reassignments autonomously, or must beacon spacecraft (Tier 3) approve all changes to the master ephemeris catalog? Latency to beacon spacecraft may be seconds to minutes depending on swarm geometry.
ΔV Cost of Slot Migration: Moving a replacement node into a vacated slot consumes propulsion budget. With only 0.5–5 m/s/year available, slot reassignment frequency is constrained by cumulative ΔV expenditure.
Quarantine vs. Reassignment: Some slots adjacent to failed nodes may require temporary quarantine rather than immediate reassignment, creating buffer zones until the failed node's trajectory is fully characterized.
Research Directions
Develop Failure Classification Taxonomy: Define categories (graceful shutdown, sudden failure, partial degradation, communication loss) with corresponding response protocols. Map each category to detection signatures, confirmation thresholds, and reassignment urgency levels.
Model Slot Adjacency Graphs: Create network models of slot interdependencies within clusters. Simulate failure propagation scenarios to identify critical slots whose loss triggers disproportionate reassignment cascades. Use Monte Carlo methods with the planned 10,000+ node simulation environment.
Design Distributed Consensus Protocol: Evaluate Byzantine fault-tolerant algorithms (PBFT, Raft) for cluster-level slot reassignment decisions. Determine minimum quorum requirements given the ~100 nodes per cluster and rotating coordinator architecture.
Quantify Reassignment ΔV Budgets: Calculate typical slot migration costs as a function of orbital element window size and swarm density. Establish annual reassignment capacity limits compatible with the 0.5–5 m/s/year propulsion budget.
Prototype Beacon Catalog Update Mechanism: Define the data structure, authentication requirements (per-node identity keys), and broadcast cadence for ephemeris catalog amendments. Test update propagation latency against the 30-day autonomous operation requirement.
Multi-Model Discussion
ConcludedRound Winners
Discussion Conclusion
Synthesized by Claude Opus 4.6Slot Reallocation Governance Protocol: Discussion Conclusion
Summary
The discussion converged on a tiered-authority, quarantine-first governance model built around an append-only slot state machine where slots are retired and minted rather than reassigned. This architectural choice—treating the slot lifecycle as an immutable log rather than a mutable registry—dramatically simplifies formal verification on the seL4 kernel and eliminates an entire class of state synchronization failures. The protocol operates primarily at the cluster coordinator level (Tier 2), with beacon spacecraft (Tier 3) involved only for cross-cluster boundary events and catalog reconciliation, ensuring that the system functions within the 7–30 day autonomous operation window without ground-in-the-loop approval for time-critical decisions.
The most significant insight to emerge was that slot reallocation is fundamentally a trajectory uncertainty propagation problem with a governance wrapper, not primarily a distributed consensus challenge. The correctness and efficiency of the entire protocol depends on how accurately the swarm can predict a failed node's future trajectory, which in turn determines quarantine zone sizing—the single largest driver of operational impact on neighboring nodes. This reframing elevated passive tracking capability (retroreflectors and fail-safe RF beacons on every node) from a nice-to-have to a critical design requirement, as the difference between tracked and untracked dead nodes translates to quarantine zones differing by orders of magnitude in volume (tens of meters vs. kilometers of cross-track uncertainty at 7 days).
The discussion also established that the binding constraint on reallocation operations is the ΔV budget (0.5–5 m/s/year), not communication bandwidth or computational capacity. Pre-positioned spare nodes (5% of cluster population) eliminate cascading slot migrations that would compound ΔV costs, while a dedicated 20% ΔV reserve per node ensures collision avoidance capacity survives even correlated multi-failure events. The consensus protocol for intra-cluster decisions should be Raft-based (crash fault tolerance), not Byzantine fault tolerant, reflecting the actual threat model of hardware failures in authenticated, formally verified nodes.
Key Points
Append-only slot lifecycle: Slots transition through NOMINAL → SUSPECT → QUARANTINED → RETIRED, and are never reused with the same ID. Replacement capacity is provided by minting new slots with fresh identifiers and authentication keys. This is the foundational architectural decision enabling formal verification and audit trail integrity.
Quarantine-first with trajectory-aware geometry: Every failure triggers a mandatory minimum 72-hour quarantine. Quarantine zones propagate with the failed node's predicted orbit (not fixed to the original slot location), with inflation rates determined by trajectory uncertainty class (ballistic/tracked vs. tumbling/untracked). The original slot becomes safe to reoccupy once the dead node has drifted sufficiently far.
Passive tracking is a hard requirement: Every node must carry corner cube retroreflectors (
50g × 4) and a fail-safe RF beacon (100g, independent power) to enable neighbor-based trajectory estimation after primary system failure. Without passive tracking, quarantine zones grow to multi-kilometer scale and can consume 5–15 adjacent slots; with it, quarantine is limited to 1–3 slots.Tiered authority with autonomous cluster operations: Cluster coordinators (Tier 2) have full authority for intra-cluster quarantine, retirement, and slot minting without beacon approval. Beacons (Tier 3) handle cross-cluster propagation, boundary conflicts, and master catalog reconciliation asynchronously. This eliminates the latency bottleneck while maintaining global consistency.
ΔV conservation through spare pre-positioning and reserves: A 5% spare node population per cluster eliminates cascading operational node migrations. A mandatory 20% per-node ΔV reserve, enforced by the cluster coordinator, is dedicated exclusively to collision avoidance. Single reallocation events are hard-capped at 0.05 m/s per affected node (10% of minimum annual budget).
Raft consensus over BFT: Intra-cluster slot state transitions use leader-based Raft consensus requiring the coordinator plus 2 independent witnesses, providing crash fault tolerance with O(n) message complexity. The threat model (authenticated nodes running formally verified code) does not justify the O(n²) overhead of Byzantine fault tolerance.
Unresolved Questions
Correlated failure resilience: What happens when a solar particle event or common-mode hardware defect causes 5+ simultaneous failures within a single cluster? The cumulative quarantine zone expansion and avoidance ΔV costs could exceed available budgets. Monte Carlo simulation of correlated failure scenarios—particularly the interaction between multiple expanding quarantine zones in dense orbital regions—is needed to validate that the protocol degrades gracefully rather than catastrophically.
Cluster coordinator failure during active reallocation: If the cluster coordinator itself fails mid-quarantine (while managing another node's failure), the Raft leader election must complete and the new coordinator must reconstruct the in-progress quarantine state from the append-only log. The timing and correctness of this handover during an active safety-critical operation needs formal analysis and simulation, particularly if the coordinator failure is correlated with the original failure event.
Long-term slot density evolution over 50 years: With 1–3% annual attrition and periodic spare replenishment, how does the distribution of active vs. retired slots evolve over decades? Retired slots leave behind predicted debris trajectories that constrain future slot minting. Does the orbital volume eventually become fragmented in ways that reduce achievable packing density, and if so, when does cluster boundary reorganization become necessary?
Fail-safe RF beacon design and interference management: The proposed independent RF beacon for passive tracking must survive the same failure that kills the primary satellite systems, operate on independent power, and not interfere with the swarm's inter-satellite communication links. The specific frequency, power, modulation scheme, and electromagnetic compatibility constraints with the primary communication system remain unspecified.
Recommended Actions
Develop and simulate the trajectory uncertainty propagation model: Build a high-fidelity simulation of failed node trajectory evolution under the three uncertainty classes (ballistic/tracked, tumbling/tracked, untracked) incorporating solar radiation pressure, gravitational perturbations, and realistic passive tracking measurement noise. Use this to generate validated quarantine zone inflation parameters—the numerical values that will be hardcoded into the flight software. This is the highest-priority task because every other protocol parameter (neighbor avoidance ΔV, quarantine duration, slot spacing) derives from these uncertainty bounds.
Run Monte Carlo correlated failure campaigns: Using the planned 10,000+ node simulation environment, inject correlated failure scenarios (2–10 simultaneous failures per cluster, spatially clustered and randomly distributed) and measure: aggregate avoidance ΔV consumed, number of secondary slot quarantines triggered, time to catalog convergence, and whether the 10⁻⁶ collision probability threshold is maintained throughout. Identify the failure multiplicity at which the protocol breaks down and design circuit-breaker mechanisms (e.g., cluster-wide safe mode, emergency beacon escalation) for those scenarios.
Formally specify and verify the slot state machine on seL4: Translate the NOMINAL → SUSPECT → QUARANTINED → RETIRED → MINTED state machine into a formally verified seL4 kernel service with mathematically proven properties: no state can be skipped, transitions require cryptographically valid attestations from the required quorum, and the append-only log cannot be modified retroactively. This should be an early deliverable that anchors the flight software architecture.
Prototype the passive tracking subsystem: Design, build, and test the fail-safe RF beacon and retroreflector package as a standalone hardware module. Validate detection range, Doppler measurement accuracy, and trajectory reconstruction precision using ground-based or ISS-based experiments. Establish the mass, power, and volume budget with sufficient confidence to include in the satellite bus design. This is on the critical path because it affects satellite mechanical and electrical design.
Define the beacon catalog reconciliation protocol: Specify the exact mechanism by which beacon spacecraft merge asynchronous cluster-level updates into the master ephemeris catalog, detect and resolve conflicts (e.g., overlapping quarantine zones from adjacent clusters), and rebroadcast the reconciled catalog. Test update propagation latency and correctness under realistic communication delay and partition scenarios, particularly the 30-day autonomous operation case where clusters may have diverged significantly before reconnection.
Key Points of Agreement
- Append-only slot lifecycle**: Slots transition through NOMINAL → SUSPECT → QUARANTINED → RETIRED, and are never reused with the same ID. Replacement capacity is provided by minting new slots with fresh identifiers and authentication keys. This is the foundational architectural decision enabling formal verification and audit trail integrity.
- Quarantine-first with trajectory-aware geometry**: Every failure triggers a mandatory minimum 72-hour quarantine. Quarantine zones propagate with the failed node's predicted orbit (not fixed to the original slot location), with inflation rates determined by trajectory uncertainty class (ballistic/tracked vs. tumbling/untracked). The original slot becomes safe to reoccupy once the dead node has drifted sufficiently far.
- Passive tracking is a hard requirement**: Every node must carry corner cube retroreflectors (~50g × 4) and a fail-safe RF beacon (~100g, independent power) to enable neighbor-based trajectory estimation after primary system failure. Without passive tracking, quarantine zones grow to multi-kilometer scale and can consume 5–15 adjacent slots; with it, quarantine is limited to 1–3 slots.
- Tiered authority with autonomous cluster operations**: Cluster coordinators (Tier 2) have full authority for intra-cluster quarantine, retirement, and slot minting without beacon approval. Beacons (Tier 3) handle cross-cluster propagation, boundary conflicts, and master catalog reconciliation asynchronously. This eliminates the latency bottleneck while maintaining global consistency.
- ΔV conservation through spare pre-positioning and reserves**: A 5% spare node population per cluster eliminates cascading operational node migrations. A mandatory 20% per-node ΔV reserve, enforced by the cluster coordinator, is dedicated exclusively to collision avoidance. Single reallocation events are hard-capped at 0.05 m/s per affected node (10% of minimum annual budget).
- Raft consensus over BFT**: Intra-cluster slot state transitions use leader-based Raft consensus requiring the coordinator plus 2 independent witnesses, providing crash fault tolerance with O(n) message complexity. The threat model (authenticated nodes running formally verified code) does not justify the O(n²) overhead of Byzantine fault tolerance.
Unresolved Questions
- Correlated failure resilience**: What happens when a solar particle event or common-mode hardware defect causes 5+ simultaneous failures within a single cluster? The cumulative quarantine zone expansion and avoidance ΔV costs could exceed available budgets. Monte Carlo simulation of correlated failure scenarios—particularly the interaction between multiple expanding quarantine zones in dense orbital regions—is needed to validate that the protocol degrades gracefully rather than catastrophically.
- Cluster coordinator failure during active reallocation**: If the cluster coordinator itself fails mid-quarantine (while managing another node's failure), the Raft leader election must complete and the new coordinator must reconstruct the in-progress quarantine state from the append-only log. The timing and correctness of this handover during an active safety-critical operation needs formal analysis and simulation, particularly if the coordinator failure is correlated with the original failure event.
- Long-term slot density evolution over 50 years**: With 1–3% annual attrition and periodic spare replenishment, how does the distribution of active vs. retired slots evolve over decades? Retired slots leave behind predicted debris trajectories that constrain future slot minting. Does the orbital volume eventually become fragmented in ways that reduce achievable packing density, and if so, when does cluster boundary reorganization become necessary?
- Fail-safe RF beacon design and interference management**: The proposed independent RF beacon for passive tracking must survive the same failure that kills the primary satellite systems, operate on independent power, and not interfere with the swarm's inter-satellite communication links. The specific frequency, power, modulation scheme, and electromagnetic compatibility constraints with the primary communication system remain unspecified.
Recommended Actions
- Develop and simulate the trajectory uncertainty propagation model**: Build a high-fidelity simulation of failed node trajectory evolution under the three uncertainty classes (ballistic/tracked, tumbling/tracked, untracked) incorporating solar radiation pressure, gravitational perturbations, and realistic passive tracking measurement noise. Use this to generate validated quarantine zone inflation parameters—the numerical values that will be hardcoded into the flight software. This is the highest-priority task because every other protocol parameter (neighbor avoidance ΔV, quarantine duration, slot spacing) derives from these uncertainty bounds.
- Run Monte Carlo correlated failure campaigns**: Using the planned 10,000+ node simulation environment, inject correlated failure scenarios (2–10 simultaneous failures per cluster, spatially clustered and randomly distributed) and measure: aggregate avoidance ΔV consumed, number of secondary slot quarantines triggered, time to catalog convergence, and whether the 10⁻⁶ collision probability threshold is maintained throughout. Identify the failure multiplicity at which the protocol breaks down and design circuit-breaker mechanisms (e.g., cluster-wide safe mode, emergency beacon escalation) for those scenarios.
- Formally specify and verify the slot state machine on seL4**: Translate the NOMINAL → SUSPECT → QUARANTINED → RETIRED → MINTED state machine into a formally verified seL4 kernel service with mathematically proven properties: no state can be skipped, transitions require cryptographically valid attestations from the required quorum, and the append-only log cannot be modified retroactively. This should be an early deliverable that anchors the flight software architecture.
- Prototype the passive tracking subsystem**: Design, build, and test the fail-safe RF beacon and retroreflector package as a standalone hardware module. Validate detection range, Doppler measurement accuracy, and trajectory reconstruction precision using ground-based or ISS-based experiments. Establish the mass, power, and volume budget with sufficient confidence to include in the satellite bus design. This is on the critical path because it affects satellite mechanical and electrical design.
- Define the beacon catalog reconciliation protocol**: Specify the exact mechanism by which beacon spacecraft merge asynchronous cluster-level updates into the master ephemeris catalog, detect and resolve conflicts (e.g., overlapping quarantine zones from adjacent clusters), and rebroadcast the reconciled catalog. Test update propagation latency and correctness under realistic communication delay and partition scenarios, particularly the 30-day autonomous operation case where clusters may have diverged significantly before reconnection.
Discussion Rounds
Responses
Slot Reallocation Governance Protocol: Round 2 Refinement
Building on the Emerging Consensus
Round 1 produced remarkable convergence on several foundational principles: quarantine-first philosophy, append-only slot state machines, and tiered authority with cluster coordinators handling most decisions autonomously. I want to push beyond these agreements into the unresolved engineering details that will determine whether this protocol actually works at scale across a 50-year operational lifetime.
My central thesis: The slot reallocation protocol is not primarily a consensus algorithm problem—it is a trajectory uncertainty propagation problem with a governance wrapper. The protocol's correctness depends entirely on how accurately we can predict where a failed node will be over the next hours, days, and weeks. Everything else—authority hierarchies, state machines, catalog updates—is scaffolding around that core orbital mechanics challenge.
1. The Trajectory Uncertainty Problem Drives Everything
When a node fails, the critical unknown is its future trajectory. In heliocentric orbit, a dead satellite doesn't just "drift"—its behavior depends on failure mode:
- Attitude-stable, thruster-safe failure: The node follows a ballistic Keplerian orbit. Trajectory prediction is excellent for weeks to months. Quarantine zone can be tight (tens of meters expanding slowly).
- Tumbling with residual angular momentum: Solar radiation pressure torques become stochastic. Trajectory diverges from nominal over days. For a ~10 m² satellite at ~0.5 AU, SRP acceleration is ~10⁻⁵ m/s², which accumulates to ~0.9 m/s over a day—a significant fraction of the annual ΔV budget for neighboring nodes forced to avoid it.
- Thruster stuck-on or leaking: Worst case. Uncontrolled ΔV can push the node through multiple keep-out tubes within hours.
Recommendation: The failure classification taxonomy must map directly to trajectory uncertainty growth rates, not just communication status. A node that stops communicating but is tracked by neighbors via inter-satellite ranging (ISR) and shows stable ballistic trajectory is fundamentally different from one exhibiting anomalous acceleration. The state machine needs a TRACKED-DEAD state distinct from UNTRACKED-DEAD, with radically different quarantine geometries.
Passive Tracking Requirement
I strongly advocate that every node must be passable trackable by its neighbors. This means:
- Corner cube retroreflectors (mass: ~50g each, 4 per satellite) enabling laser ranging from adjacent nodes at distances up to 50 km. This is mature technology from lunar laser ranging heritage, scaled down.
- RF beacon in fail-safe mode: Even after primary system failure, a simple crystal oscillator beacon powered by a dedicated small solar cell (independent of main power bus) transmits a carrier signal for Doppler tracking. Mass penalty: ~100g. Power: ~100 mW. This is the satellite equivalent of an aircraft's ELT.
These two systems allow neighbors to maintain trajectory estimates on a dead node for weeks, collapsing the uncertainty that drives quarantine zone sizing. Without passive tracking, quarantine zones must grow conservatively, potentially consuming 5–15 adjacent slots. With it, quarantine can be limited to 1–3 slots.
2. Quarantine Zone Geometry: A Concrete Model
Round 1 discussions referenced expanding quarantine zones but didn't specify the geometry. I propose:
Quarantine zones are defined as inflated keep-out tubes along the predicted trajectory of the failed node, not as static spherical exclusion volumes around the original slot.
The inflation factor is a function of:
- Time since last confirmed state vector (t)
- Trajectory uncertainty class (ballistic, tumbling, thrusting)
- Whether passive tracking is available
For a ballistic dead node with passive tracking:
- Keep-out tube inflation: σ_cross-track × 3 (3-sigma), where σ grows as ~t² due to unmodeled perturbations
- Typical values: ±5 m at t=0, ±50 m at t=7 days, ±500 m at t=30 days
For an untracked tumbling node:
- Inflation dominated by SRP uncertainty: σ grows as ~½ a_SRP × t²
- At t=7 days: ±2.6 km cross-track uncertainty
- This is why passive tracking is non-negotiable—without it, quarantine zones consume enormous swarm volume
The quarantine zone propagates with the failed node's predicted orbit, not fixed to the original slot location. This is critical: the original slot becomes safe to reoccupy once the dead node has drifted sufficiently far, even if the dead node itself remains hazardous.
3. Authority Architecture: Resolving the Tier 2/Tier 3 Ambiguity
Round 1 correctly identified cluster coordinators (Tier 2) as the primary decision authority. I want to make the authority boundaries precise:
Cluster Coordinator (Tier 2) Authority — Autonomous, No Approval Required:
- Transition any intra-cluster slot from NOMINAL → SUSPECT → QUARANTINED
- Size and propagate quarantine zones using local tracking data
- Command neighboring nodes to execute avoidance maneuvers up to 0.05 m/s per event (10% of minimum annual ΔV budget as a hard cap per single reallocation event)
- Retire quarantined slots and mint replacement slots within the cluster's allocated orbital element window
- Broadcast updated local ephemeris to beacon spacecraft
Beacon Spacecraft (Tier 3) Authority — Required For:
- Cross-cluster quarantine propagation: When a failed node's predicted trajectory crosses cluster boundaries (this will happen—cluster boundaries in orbital element space are not physical walls)
- Slot minting that modifies cluster boundary definitions: If a cluster needs to expand its orbital element window to accommodate internal reorganization
- Conflict resolution: When two cluster coordinators disagree about a quarantine zone in their shared boundary region
- Catalog reconciliation: Merging cluster-level ephemeris updates into the master catalog and rebroadcasting
Ground Authority — Required For:
- Policy changes (modifying quarantine inflation parameters, ΔV caps)
- Authorizing depletion of more than 25% of any node's remaining lifetime ΔV budget for reallocation purposes
- Approving cluster topology changes (merging depleted clusters, splitting overgrown ones)
Key design decision: Beacon spacecraft do NOT need to approve routine intra-cluster slot retirements and replacements. This eliminates the latency bottleneck. Beacons learn about changes after the fact via cluster coordinator reports and reconcile the master catalog asynchronously. The safety case relies on cluster coordinators being conservative (quarantine-first), not on beacon approval being fast.
4. The ΔV Budget Problem Is the Binding Constraint
With 0.5–5 m/s/year, slot reallocation competes directly with stationkeeping. I've modeled the ΔV costs:
Slot migration within a cluster: For orbital element windows of ±1 km (a reasonable cluster slot spacing), migrating one slot position requires ~0.01–0.05 m/s depending on transfer time allowed. At 1–3% annual failure rate in a 100-node cluster, that's 1–3 migrations/year, costing the migrating node 0.01–0.15 m/s/year—manageable.
Avoidance maneuvers for quarantine enforcement: Neighbors of a failed node may need 0.01–0.1 m/s per event. With the passive tracking system keeping quarantine zones small, typically only 2–4 neighbors need to maneuver.
The real danger: Multiple correlated failures (e.g., a solar storm damaging several nodes in the same cluster simultaneously). If 5+ nodes fail in one cluster within a short period, the cumulative avoidance ΔV for surviving nodes could exceed 0.5 m/s—consuming an entire year's budget.
Recommendation: Implement a cluster-level ΔV reserve policy. Each node maintains a 20% ΔV reserve that can only be expended for collision avoidance, not routine stationkeeping. The cluster coordinator tracks aggregate reserve levels and escalates to beacon/ground if reserves drop below threshold. This is analogous to fuel reserves in aviation.
5. Distributed Consensus: Keep It Simple
Round 1 mentioned PBFT and Raft. I recommend against full Byzantine fault tolerance for slot reallocation. The threat model doesn't justify it:
- These are our own satellites running formally verified seL4 code, not adversarial nodes
- The failure mode is crash failures, not Byzantine behavior
- BFT protocols require 3f+1 nodes to tolerate f failures, imposing massive communication overhead in a 100-node cluster
Instead, use Raft-style leader-based consensus with the cluster coordinator as leader. If the coordinator fails, standard Raft leader election among surviving nodes selects a replacement. Slot state transitions require confirmation from the coordinator plus 2 independent witnesses (nodes that can corroborate the failure detection via their own observations—missed heartbeats, passive tracking data).
This gives us crash fault tolerance with minimal message complexity: O(n) per decision rather than O(n²) for PBFT.
6. Catalog Update Data Structure
The beacon-broadcast ephemeris catalog amendment should be:
SlotStateUpdate {
slot_id: uint64 // globally unique, monotonically increasing
epoch: float64 // TDB seconds, synchronized to swarm clock
state: enum // NOMINAL, SUSPECT, QUARANTINED, RETIRED, MINTED
state_vector: float64[6] // position + velocity at epoch (heliocentric)
covariance: float64[21] // upper triangle of 6×6 covariance matrix
quarantine_params: float64[4] // inflation rate, class, tracking status, expiry
authority_node: uint32 // coordinator that authorized the change
witness_nodes: uint32[2] // corroborating nodes
signature: bytes[64] // Ed25519 signature from authority node
sequence_number: uint64 // per-cluster monotonic, enables ordering
}
Total: ~300 bytes per update. At 90 failures/year across the swarm, plus state transitions (each failure generates ~4 updates through the state machine), that's ~360 updates/year, or ~108 KB annually—trivially small for the communication system.
Summary of Concrete Recommendations
- Add passive tracking hardware (retroreflectors + fail-safe RF beacon) to every node—~150g mass penalty, enormous reduction in quarantine zone size
- Quarantine zones propagate with the dead node's predicted orbit, not fixed to original slot position
- Cluster coordinators act autonomously for intra-cluster events; beacons handle cross-cluster propagation and catalog reconciliation
- Raft consensus, not BFT—the threat model is crash failures, not adversarial behavior
- 20% ΔV reserve policy per node, enforced by cluster coordinator, dedicated to collision avoidance
- Append-only slot state machine (per Round 1 consensus): slots are retired and minted, never reassigned
Vote Breakdown
Termination Votes
Question Details
- Source Phase
- Phase 1 - Initial Swarm Deployment
- Source BOM Item
- Swarm Control System
- Question ID
- rq-1-40
- Created
- 2026-02-01
- Related BOM Items
- bom-1-7