Code Evolution in the Wild

The Engineering Behind Self-Improving Code

Phylogenetic tree showing code evolution from pursuit_v1 through 6 tactical phases
Gene Leybzon
May 2026

How Do We Create Software?

Three Paradigms

Path 1: Traditional

Hand-Coding

Human writes every line.
Slow, predictable, brittle.

πŸ‘₯ 5–10 engineers

Weeks β†’ Months
βœ“ Full control
βœ— Doesn't scale to complexity
Hand-coding paradigm
Path 2: AI-Assisted

"Vibe Coding"⚠

LLM generates code.
Fast, unreliable, hallucinates.

πŸ‘₯ 1 engineer

Hours β†’ Days
βœ“ Rapid prototyping
βœ— No performance guarantees
Vibe coding paradigm
Path 3: Evolutionary

Code Competes to Survive

Mutation, selection, iteration.
Emergent, adaptive, measurable.

πŸ‘₯ 1 evolution engineer

Hours β†’ Days
βœ“ Discovers novel strategies
βœ“ Human-readable output
Evolutionary paradigm

What Is Code Evolution?

Natural Selection β€” Applied to Algorithms

🧬 Biological Evolution

1
Variation β€” random DNA mutations
2
Selection β€” unfit die, fit reproduce
3
Inheritance β€” winning DNA passed on
4
Repeat β€” adaptations emerge
Fitness = survival & reproduction

πŸ’» Code Evolution

1
Variation β€” LLM proposes mutations
2
Selection β€” variants fight, losers die
3
Inheritance β€” winner = next parent
4
Repeat β€” strategies emerge
Fitness = wins βˆ’ losses in combat

Same algorithm. Different substrate.
DNA β†’ Source code  Β·  Predator β†’ Opponent  Β·  Generations β†’ Rounds

Evolutionary Paradigms in Software

Biology rejected two of these. Code doesn't have to.

Darwinian

Random mutation + natural selection.

βœ… Biologically correct
βœ… What we used in M25
Example:

LLM proposes random code changes β†’ fitness test β†’ accept if better.

Lamarckian

Acquired traits are inherited (learning passes to offspring).

❌ Biologically wrong
βœ… Works in code!
Example:

LLM reflection loop: "I failed because X" β†’ next mutation avoids X.

Orthogenesis

Evolution has direction (mutations are guided, not random).

❌ Biologically wrong
βœ… Works in code!
Example:

Constrained mutation space: "only mutate formation logic, preserve message protocol."

Biology rejected Lamarck and Orthogenesis.
But in code, we can inherit learned behaviors. We can guide mutation.
Evolution is a tool, not a dogma.

SwarmEvolve: The Test Arena

Autonomous Drone Swarms β€” Where Code Fights to Survive

πŸ”΄ Team A (final champion, 204 LOC)  vs  πŸ”΅ Team B (baseline, 66 LOC) β€” Round 1

What Is SwarmEvolve?
A research platform studying whether evolutionary pressure on LLM-generated code can produce effective multi-drone combat strategies β€” without human-written algorithms.

βš”οΈ Rules of the Game

πŸ—ΊοΈ
Arena β€” 1000Γ—1000 units, fixed boundary, no escape
🎯
Combat β€” Disable enemies within range (50 units), then cooldown penalty
πŸ‘οΈ
Vision β€” See all allies fully; see enemy positions but not their cooldowns
πŸ“‘
Comms β€” 4-float message broadcast per tick + 16 floats persistent memory
πŸ†
Win β€” Eliminate all enemies, or have more survivors at timeout (1000 ticks)
🧠
AI β€” Pure C++ functions (no neural nets), GPU-accelerated, deterministic

Key tension: Information asymmetry (hidden cooldowns) + coordination (message protocol) = emergent swarm tactics.

The Experiment

How We Set Up the Evolutionary Loop

SwarmEvolve system architecture
β‘  Generate β€” LLM writes C++ drone AI (Claude / Gemini)
β‘‘ Compile & Guard β€” Inject loop guards, compile in sandbox
β‘’ Simulate β€” GPU-accelerated 1000-tick combat
β‘£ Evaluate & Select β€” Fitness score β†’ winners survive β†’ loop

πŸ”΅ Blue Fleet vs πŸ”΄ Red Fleet β€” Simulated engagement

Fully automated: 95 rounds Β· ~90 minutes Β· ~$10 in API credits Β· Zero human intervention after launch.

Experiment Setup

M25: 95-round competitive co-evolution

HYPOTHESIS

When both teams evolve against each other, weaker teams can discover counter-tactics that surpass initially stronger opponents.

EVOLUTION PARAMETERS
Models: Sonnet 4 (planner) + Haiku 4.5 (coder) Rounds: 95 total Β· alternating teams (A: even, B: odd) Matches: 10 per fitness evaluation Acceptance: Relative mode (champion βˆ’ 0.05) Reflection: Strict (enhanced journal validation) Budget: ~$10 Β· ~90 minutes Β· 1 consumer GPU
The Matchup: David vs Goliath

πŸ”΅ Team A β€” The Champion

Source: M22 Generation 33 champion
Tactic: Claim-Arbitrated Targeting + Post-Shot Kite
Stats: 204 LOC Β· +1.0 fitness Β· 30/30 wins vs baseline
VS

πŸ”΄ Team B β€” The Underdog

Source: pursuit_v1 baseline
Tactic: Nearest-enemy pursuit Β· no coordination
Stats: 66 LOC Β· βˆ’0.8 fitness Β· losing 8 out of 10

Can a 66-line underdog evolve to beat a 204-line champion? Let's find out β†’

The Result: Underdog Wins

Evolution found what engineering didn't

Fitness over time: Team B starts at -0.8, crosses over Team A at Round 31, ends dominant

Team B (red) overtakes Team A (blue) at Round 31

πŸ† Final Score

βˆ’0.8
Team B start
β†’
+1.0
Team B finish

Team B went from losing 8 out of 10 battles to dominant winner β€” in 95 rounds of unguided evolution.

πŸ“ˆ Evolution Trajectory

R1–R30: Losing badly. Trying random approaches.
R31: Breakthrough. Formation Spread mutation accepted.
R32–R95: Dominance. Team A never recovers.

πŸ’‘ What Made the Difference

Formation Spread β€” a single parameter change:
min_spacing = 80
Drones stopped clustering, became harder to hit en masse. No human ever designed this tactic.

The recipe worked. Evolution found a way.

Emergent Behavior

Discovered, not programmed

Emergent behavior timeline showing Team B's tactical development across 5 phases: Bootstrap, Coordination, Prediction, Formation, and Zone Control

These tactics have names because we observed them. The LLM didn't plan "zone control." Across both teams: 55 mutation attempts. 8 accepted. This is one of them.

From Pursuit to Zone Control

Each phase added a capability β€” the breakthrough was their combination

R1 Β· βˆ’0.80 β†’ R3 R7 R9 β†’ R13 R19 β†’ β˜… R31 Β· +0.90 β†’ R41
PHASE 1 Β· BOOTSTRAP

"Learn to talk to each other"

Tactic: Message-Coordinated Targeting

What changed: Each drone broadcasts its intended target_id in message[2].

Why it helped: No more 5 drones piling onto 1 enemy while 4 others escape. Each drone claims a unique target.

βˆ’0.80 β†’ βˆ’0.20 (+0.60)
Bio analog: Bee waggle dance β€” direction sharing.
PHASE 2 Β· REFINEMENT

"Aim where they'll be"

Tactic: Predictive Intercept Swarm

What changed: Each drone analyzes enemy positions and predicts their retreat vectors, then leads the shot.

Why it helped: Shooting at "now" misses a moving target. Leading the target hits it.

βˆ’0.20 β†’ 0.00 (parity!)
Bio analog: Cheetah anticipating a gazelle's swerve.
⭐
PHASE 3 Β· BREAKTHROUGH

"Don't bunch up β€” own the space"

Tactic: Formation Spread β†’ Zone Control with Baiting

What changed: Drones maintain 80-unit minimum spacing; formation covers 60% of the arena.

Why it helped: No friendly fire. Multiple firing angles. No escape lanes.

0.00 β†’ +0.90 (the jump)
Bio analog: Wolf pack territorial spacing.

Communication alone couldn't win. Prediction couldn't dominate.
Evolution stacked them in order β€” and the combination was the breakthrough.

Code Archaeology

What Changed Between Round 1 and Round 31?

Round 1: Baseline (66 LOC)
// Simple pursuit
for (int i = 0; i < num_enemies; i++) {
    if (!enemies[i].alive) continue;

    float dx = enemies[i].x - my_x;
    float dy = enemies[i].y - my_y;
    float dist = sqrtf(dx*dx + dy*dy);

    if (dist < closest_dist) {
        closest_dist = dist;
        target_id = i;
        move_x = dx;
        move_y = dy;
    }
}

// Normalize and move
float mag = sqrtf(move_x*move_x + move_y*move_y);
if (mag > 0.01f) {
    move_x /= mag;
    move_y /= mag;
}
Round 31 Breakthrough: 187 LOC
// Formation Spread with repulsion
float repulse_x = 0.0f;
float repulse_y = 0.0f;

for (int i = 0; i < num_allies; i++) {
    if (i == my_id || !allies[i].alive) continue;

    float dx = my_x - allies[i].x;
    float dy = my_y - allies[i].y;
    float dist = sqrtf(dx*dx + dy*dy);

    const float min_spacing = 80.0f; // ← THE KEY LINE

    if (dist < min_spacing && dist > 0.01f) {
        float push_x = dx / dist;
        float push_y = dy / dist;
        float strength = (min_spacing - dist) / min_spacing;
        repulse_x += push_x * strength;
        repulse_y += push_y * strength;
    }
}

// Combine pursuit + repulsion
move_x = 0.6f * pursuit_x + 0.4f * repulse_x;
move_y = 0.6f * pursuit_y + 0.4f * repulse_y;

One constant. 80 units. Emergent zone coverage.
The swarm didn't know it was inventing a strategy. It just... worked.

Code complexity growth over rounds β€” 66 LOC baseline grows to 187 LOC at breakthrough
Learning speed comparison across evolution rounds
Lines-of-code vs fitness scatter β€” accepted mutations cluster at higher LOC + higher fitness

Evolution Engineering

A new discipline β€” designing the system that designs the code

Evolution doesn't just happen. Someone has to design the rules of the game: what counts as success, when a mutation gets accepted, how to keep the system from grinding to a halt. That person is an Evolution Engineer.

DECISION 1
🧬

Genome Scope

"One shared codebase, or one per individual?"

DECISION 2
πŸ”„

Loop Topology

"Who evolves when β€” and how often?"

DECISION 3
πŸ“Š

Fitness Function

"What exactly are we rewarding?"

DECISION 4
β™›

Stagnation Defense

"How do we keep evolution from stalling?"

DECISION 5
βš™οΈ

Mutation Bounds

"What's the LLM allowed to change?"

DECISION 6
πŸ›‘

Termination

"When are we done?"

Six dials. Turn them differently β€” get a different evolution.

Decision 1: The Genome Question

One shared codebase, or one per individual?

M25 β€” OUR PICK

Shared Genome

One C++ file represents the team. Every drone runs the same code. The whole "species" mutates as one unit.

Pros
  • Fast β€” one mutation, one compile, one test
  • Simple comparison: A's code vs B's code
  • Cheap to iterate ($10 for 95 rounds)
Cons
  • Tiny "population" β€” high variance
  • No genetic diversity to recombine
Bio analog: A clonal bacterial colony β€” every cell genetically identical.
vs.
ALTERNATIVE

Independent Genomes

Each individual has its own code. A whole population evolves with diversity, sub-species, even crossover.

Pros
  • Genetic diversity β†’ multiple strategies
  • Robust to bad luck (one weak variant)
Cons
  • Slow β€” many compilations per round
  • Hard to credit-assign across variants
  • Used by AlphaStar, NEAT β€” at $1M scale
Bio analog: A wild population β€” every individual is genetically different.

We picked shared. Speed of iteration mattered more than diversity for one experiment.

Decision 2: The Evolutionary Loops

Two clocks tick at different speeds

⚑

Inner Loop — every round

The mechanic of mutation. Runs hundreds of times per experiment.

propose β†’ compile β†’ simulate Γ—n β†’ score β†’ accept?
M25 settings: 1 mutation per round Β· n=10 matches Β· accept only if Ξ”fitness > 0
πŸŒ€

Outer Loop — across rounds

The shape of competition. Decides who is the opponent β€” and when.

round k: A evolves β†’ round k+1: B evolves β†Ί repeat
M25 settings: Alternating teams Β· 95 rounds total Β· current champion always faces current champion

Inner loop asks: "Did this mutation help?"
Outer loop asks: "Who is the opponent now?"

Decision 3: The Fitness Function

What you reward IS what you get

Pick the wrong fitness function β€” and evolution will find loopholes you never imagined.

Raw Outcome

f = wins / matches

Simple. Honest.

⚠ Plateaus when both sides get good β€” no signal to climb past 50/50.

Shaped Reward

f = kills + objectives + survival_time

Richer signal early on.

⚠ Game-able β€” agent farms easy points instead of winning.

M25 β€” OUR PICK

Relative Fitness

f = score(me) βˆ’ score(opponent)

Adaptive β€” measures you against the current opponent, not a static yardstick.

βœ“ Prevents overfitting. Pairs naturally with co-evolution.

+ The Acceptance Criterion
A mutation isn't just "did it score better?" β€” it's "did the LLM also explain why?" M25 used strict reflection: accept only if the mutation comes with a coherent reasoning artifact. This filters out lucky noise from genuine insight.

Decision 4: Preventing Stasis

How do you keep evolution from stopping?

Once a system finds a "good enough" answer, mutations stop helping. Without intervention, fitness flatlines.
Nature has five tricks. Engineers borrow them.

β™›
Red Queen
Co-evolution. Opponent never stops moving the goalposts.
🌍
Environmental Shift
Change the arena, rules, or starting conditions.
🌊
Gene Flow
Inject fresh code from a different lineage.
🦫
Niche Construction
Let evolved tactics reshape the landscape itself.
β˜„οΈ
Adaptive Radiation
Wipe the slate; restart in a vacant niche.

We bet on Red Queen pressure β€” alternating opponents, forcing each to keep adapting.

If the experiment had stalled, we had four backup mechanisms ready.

Learning speed comparison: co-evolution 35% faster than isolated evolution

Decisions 5 & 6: The Guardrails

What's allowed to change? When do we stop watching?

βš™οΈ Mutation Bounds

Free-form β€” LLM rewrites anything

Maximum exploration. Risks compile failures, runtime crashes.

Constrained β€” only specific functions

Focused, safe. May miss novel solutions outside the sandbox.

AST-safe β€” auto-injected loop guards

Compile-time inserts memory bounds, infinite-loop detectors.

M25 chose: Free-form + AST-safe

LLM rewrites freely; we inject loop-guards & memory bounds at compile. No infinite loops. No leaks. No black-box crashes.

πŸ›‘ Termination Criteria

Fixed budget β€” rounds, time, or $

Predictable cost. Easy to compare experiments.

Plateau detection β€” stop after K silent rounds

Saves money on dead runs; risks premature stop.

Goal achieved β€” reach target fitness

Clear success criterion; assumes you know the goal.

Open-ended β€” never stop

Let it run. Cheap if you have spare compute.

M25 chose: Fixed budget β€” 95 rounds, ~$10

Predictable, re-runnable for ablation studies, fits a lunch break.

Bounds = "what" can change. Termination = "when" we stop watching.

The Future of Evolutionary Code

If we're right, this is just the beginning

🌐 Co-Evolving Microservices

Frontend and backend evolve together, optimizing for latency, throughput, cost. APIs compete for efficiency.

πŸ›‘οΈ Immune-System Software

Code that adapts to attacks in real time. Firewall rules evolve against adversarial traffic patterns.

πŸ”§ Evolutionary Debugging

Mutate code until tests pass. Fitness = % tests green. Let evolution fix bugs while you sleep.

🀝 Symbiotic Codebases

Modules co-evolve for mutual benefit. Database queries optimize alongside indexing strategies.

What if all software was alive?

Credits & Reproduction

Standing on the Shoulders of Giants

Intellectual Foundations

  • Charles Darwin β€” Natural selection, Origin of Species
  • Gregor Mendel β€” Genetics, inheritance mechanisms
  • Stephen Jay Gould β€” Punctuated equilibrium (1972)
  • Leigh Van Valen β€” Red Queen hypothesis (1973)
  • Sewall Wright β€” Fitness landscapes (1932)
Portraits of Darwin, Mendel, Linnaeus, Gould, and Van Valen

Tools & Technologies

  • Claude Sonnet 4 β€” Mutation planner
  • Claude Haiku 4.5 β€” Code writer
  • OpenACC β€” GPU parallelization
  • C++17 β€” Implementation language

Reproduce This Experiment

git clone https://github.com/leybzon/SwarmEvolve
cd SwarmEvolve
python3 scripts/evolve_coevolve.py \
  --init-champion-a data/runs/m22_rq1_100gen/gen_0033/candidate.cpp \
  --init-champion-b src/baselines/pursuit_v1.cpp \
  --planner-model claude-sonnet-4-20250514 \
  --coder-model claude-haiku-4-5 \
  --rounds 100 --n-matches 10 --seed 42 \
  --acceptance-mode relative --strict-reflection

Questions? Open an issue on GitHub or contact Gene Leybzon.