Code Evolution in the Wild

The Engineering Behind Self-Improving Code

Phylogenetic tree showing code evolution from pursuit_v1 through 6 tactical phases

Gene Leybzon

May 2026

How Do We Create Software?

Three Paradigms

Path 1: Traditional

Hand-Coding

Human writes every line.
Slow, predictable, brittle.

👥 5–10 engineers

Weeks → Months

✓ Full control
✗ Doesn't scale to complexity

Path 2: AI-Assisted

"Vibe Coding"⚠

LLM generates code.
Fast, unreliable, hallucinates.

👥 1 engineer

Hours → Days

✓ Rapid prototyping
✗ No performance guarantees

Path 3: Evolutionary

Code Competes to Survive

Mutation, selection, iteration.
Emergent, adaptive, measurable.

👥 1 evolution engineer

Hours → Days

✓ Discovers novel strategies
✓ Human-readable output

What Is Code Evolution?

Natural Selection — Applied to Algorithms

🧬 Biological Evolution

Variation — random DNA mutations

Selection — unfit die, fit reproduce

Inheritance — winning DNA passed on

Repeat — adaptations emerge

Fitness = survival & reproduction

💻 Code Evolution

Variation — LLM proposes mutations

Selection — variants fight, losers die

Inheritance — winner = next parent

Repeat — strategies emerge

Fitness = wins − losses in combat

Same algorithm. Different substrate.
DNA → Source code · Predator → Opponent · Generations → Rounds

Evolutionary Paradigms in Software

Biology rejected two of these. Code doesn't have to.

Darwinian

Random mutation + natural selection.

✅ Biologically correct

✅ What we used in M25

Example:

LLM proposes random code changes → fitness test → accept if better.

Lamarckian

Acquired traits are inherited (learning passes to offspring).

❌ Biologically wrong

✅ Works in code!

Example:

LLM reflection loop: "I failed because X" → next mutation avoids X.

Orthogenesis

Evolution has direction (mutations are guided, not random).

❌ Biologically wrong

✅ Works in code!

Example:

Constrained mutation space: "only mutate formation logic, preserve message protocol."

Biology rejected Lamarck and Orthogenesis.
But in code, we can inherit learned behaviors. We can guide mutation.
Evolution is a tool, not a dogma.

SwarmEvolve: The Test Arena

Autonomous Drone Swarms — Where Code Fights to Survive

🔴 Team A (final champion, 204 LOC) vs 🔵 Team B (baseline, 66 LOC) — Round 1

What Is SwarmEvolve?

A research platform studying whether evolutionary pressure on LLM-generated code can produce effective multi-drone combat strategies — without human-written algorithms.

⚔️ Rules of the Game

🗺️

Arena — 1000×1000 units, fixed boundary, no escape

🎯

Combat — Disable enemies within range (50 units), then cooldown penalty

👁️

Vision — See all allies fully; see enemy positions but not their cooldowns

📡

Comms — 4-float message broadcast per tick + 16 floats persistent memory

🏆

Win — Eliminate all enemies, or have more survivors at timeout (1000 ticks)

🧠

AI — Pure C++ functions (no neural nets), GPU-accelerated, deterministic

Key tension: Information asymmetry (hidden cooldowns) + coordination (message protocol) = emergent swarm tactics.

The Experiment

How We Set Up the Evolutionary Loop

① Generate — LLM writes C++ drone AI (Claude / Gemini)

② Compile & Guard — Inject loop guards, compile in sandbox

③ Simulate — GPU-accelerated 1000-tick combat

④ Evaluate & Select — Fitness score → winners survive → loop

🔵 Blue Fleet vs 🔴 Red Fleet — Simulated engagement

Fully automated: 95 rounds · ~90 minutes · ~$10 in API credits · Zero human intervention after launch.

Experiment Setup

M25: 95-round competitive co-evolution

HYPOTHESIS

When both teams evolve against each other, weaker teams can discover counter-tactics that surpass initially stronger opponents.

EVOLUTION PARAMETERS

Models: Sonnet 4 (planner) + Haiku 4.5 (coder) Rounds: 95 total · alternating teams (A: even, B: odd) Matches: 10 per fitness evaluation Acceptance: Relative mode (champion − 0.05) Reflection: Strict (enhanced journal validation) Budget: ~$10 · ~90 minutes · 1 consumer GPU

The Matchup: David vs Goliath

🔵 Team A — The Champion

Source: M22 Generation 33 champion
Tactic: Claim-Arbitrated Targeting + Post-Shot Kite
Stats: 204 LOC · +1.0 fitness · 30/30 wins vs baseline

🔴 Team B — The Underdog

Source: pursuit_v1 baseline
Tactic: Nearest-enemy pursuit · no coordination
Stats: 66 LOC · −0.8 fitness · losing 8 out of 10

Can a 66-line underdog evolve to beat a 204-line champion? Let's find out →

The Result: Underdog Wins

Evolution found what engineering didn't

Fitness over time: Team B starts at -0.8, crosses over Team A at Round 31, ends dominant

Team B (red) overtakes Team A (blue) at Round 31

🏆 Final Score

−0.8

Team B start

→

+1.0

Team B finish

Team B went from losing 8 out of 10 battles to dominant winner — in 95 rounds of unguided evolution.

📈 Evolution Trajectory

R1–R30: Losing badly. Trying random approaches.

R31: Breakthrough. Formation Spread mutation accepted.

R32–R95: Dominance. Team A never recovers.

💡 What Made the Difference

Formation Spread — a single parameter change:
min_spacing = 80
Drones stopped clustering, became harder to hit en masse. No human ever designed this tactic.

The recipe worked. Evolution found a way.

Emergent Behavior

Discovered, not programmed

These tactics have names because we observed them. The LLM didn't plan "zone control." Across both teams: 55 mutation attempts. 8 accepted. This is one of them.

From Pursuit to Zone Control

Each phase added a capability — the breakthrough was their combination

R1 · −0.80 → R3 R7 R9 → R13 R19 → ★ R31 · +0.90 → R41

PHASE 1 · BOOTSTRAP

"Learn to talk to each other"

Tactic: Message-Coordinated Targeting

What changed: Each drone broadcasts its intended target_id in message[2].

Why it helped: No more 5 drones piling onto 1 enemy while 4 others escape. Each drone claims a unique target.

−0.80 → −0.20 (+0.60)

Bio analog: Bee waggle dance — direction sharing.

PHASE 2 · REFINEMENT

"Aim where they'll be"

Tactic: Predictive Intercept Swarm

What changed: Each drone analyzes enemy positions and predicts their retreat vectors, then leads the shot.

Why it helped: Shooting at "now" misses a moving target. Leading the target hits it.

−0.20 → 0.00 (parity!)

Bio analog: Cheetah anticipating a gazelle's swerve.

⭐

PHASE 3 · BREAKTHROUGH

"Don't bunch up — own the space"

Tactic: Formation Spread → Zone Control with Baiting

What changed: Drones maintain 80-unit minimum spacing; formation covers 60% of the arena.

Why it helped: No friendly fire. Multiple firing angles. No escape lanes.

0.00 → +0.90 (the jump)

Bio analog: Wolf pack territorial spacing.

Communication alone couldn't win. Prediction couldn't dominate.
Evolution stacked them in order — and the combination was the breakthrough.

Code Archaeology

What Changed Between Round 1 and Round 31?

Round 1: Baseline (66 LOC)

// Simple pursuit
for (int i = 0; i < num_enemies; i++) {
    if (!enemies[i].alive) continue;

    float dx = enemies[i].x - my_x;
    float dy = enemies[i].y - my_y;
    float dist = sqrtf(dx*dx + dy*dy);

    if (dist < closest_dist) {
        closest_dist = dist;
        target_id = i;
        move_x = dx;
        move_y = dy;
    }
}

// Normalize and move
float mag = sqrtf(move_x*move_x + move_y*move_y);
if (mag > 0.01f) {
    move_x /= mag;
    move_y /= mag;
}

Round 31 Breakthrough: 187 LOC

// Formation Spread with repulsion
float repulse_x = 0.0f;
float repulse_y = 0.0f;

for (int i = 0; i < num_allies; i++) {
    if (i == my_id || !allies[i].alive) continue;

    float dx = my_x - allies[i].x;
    float dy = my_y - allies[i].y;
    float dist = sqrtf(dx*dx + dy*dy);

    const float min_spacing = 80.0f; // ← THE KEY LINE

    if (dist < min_spacing && dist > 0.01f) {
        float push_x = dx / dist;
        float push_y = dy / dist;
        float strength = (min_spacing - dist) / min_spacing;
        repulse_x += push_x * strength;
        repulse_y += push_y * strength;
    }
}

// Combine pursuit + repulsion
move_x = 0.6f * pursuit_x + 0.4f * repulse_x;
move_y = 0.6f * pursuit_y + 0.4f * repulse_y;

One constant. 80 units. Emergent zone coverage.
The swarm didn't know it was inventing a strategy. It just... worked.

Code complexity growth over rounds — 66 LOC baseline grows to 187 LOC at breakthrough

Learning speed comparison across evolution rounds

Lines-of-code vs fitness scatter — accepted mutations cluster at higher LOC + higher fitness

Evolution Engineering

A new discipline — designing the system that designs the code

Evolution doesn't just happen. Someone has to design the rules of the game: what counts as success, when a mutation gets accepted, how to keep the system from grinding to a halt. That person is an Evolution Engineer.

DECISION 1

🧬

Genome Scope

"One shared codebase, or one per individual?"

DECISION 2

🔄

Loop Topology

"Who evolves when — and how often?"

DECISION 3

📊

Fitness Function

"What exactly are we rewarding?"

DECISION 4

♛

Stagnation Defense

"How do we keep evolution from stalling?"

DECISION 5

⚙️

Mutation Bounds

"What's the LLM allowed to change?"

DECISION 6

🛑

Termination

"When are we done?"

Six dials. Turn them differently — get a different evolution.

Decision 1: The Genome Question

One shared codebase, or one per individual?

M25 — OUR PICK

Shared Genome

One C++ file represents the team. Every drone runs the same code. The whole "species" mutates as one unit.

Pros

Fast — one mutation, one compile, one test
Simple comparison: A's code vs B's code
Cheap to iterate ($10 for 95 rounds)

Cons

Tiny "population" — high variance
No genetic diversity to recombine

Bio analog: A clonal bacterial colony — every cell genetically identical.

vs.

ALTERNATIVE

Independent Genomes

Each individual has its own code. A whole population evolves with diversity, sub-species, even crossover.

Pros

Genetic diversity → multiple strategies
Robust to bad luck (one weak variant)

Cons

Slow — many compilations per round
Hard to credit-assign across variants
Used by AlphaStar, NEAT — at $1M scale

Bio analog: A wild population — every individual is genetically different.

We picked shared. Speed of iteration mattered more than diversity for one experiment.

Decision 2: The Evolutionary Loops

Two clocks tick at different speeds

⚡

Inner Loop — every round

The mechanic of mutation. Runs hundreds of times per experiment.

propose → compile → simulate ×n → score → accept?

M25 settings: 1 mutation per round · n=10 matches · accept only if Δfitness > 0

🌀

Outer Loop — across rounds

The shape of competition. Decides who is the opponent — and when.

round k: A evolves → round k+1: B evolves ↺ repeat

M25 settings: Alternating teams · 95 rounds total · current champion always faces current champion

Inner loop asks: "Did this mutation help?"
Outer loop asks: "Who is the opponent now?"

Decision 3: The Fitness Function

What you reward IS what you get

Pick the wrong fitness function — and evolution will find loopholes you never imagined.

Raw Outcome

f = wins / matches

Simple. Honest.

⚠ Plateaus when both sides get good — no signal to climb past 50/50.

Shaped Reward

f = kills + objectives + survival_time

Richer signal early on.

⚠ Game-able — agent farms easy points instead of winning.

M25 — OUR PICK

Relative Fitness

f = score(me) − score(opponent)

Adaptive — measures you against the current opponent, not a static yardstick.

✓ Prevents overfitting. Pairs naturally with co-evolution.

+ The Acceptance Criterion

A mutation isn't just "did it score better?" — it's "did the LLM also explain why?" M25 used strict reflection: accept only if the mutation comes with a coherent reasoning artifact. This filters out lucky noise from genuine insight.

Decision 4: Preventing Stasis

How do you keep evolution from stopping?

Once a system finds a "good enough" answer, mutations stop helping. Without intervention, fitness flatlines.
Nature has five tricks. Engineers borrow them.

♛

Red Queen

Co-evolution. Opponent never stops moving the goalposts.

🌍

Environmental Shift

Change the arena, rules, or starting conditions.

🌊

Gene Flow

Inject fresh code from a different lineage.

🦫

Niche Construction

Let evolved tactics reshape the landscape itself.

☄️

Adaptive Radiation

Wipe the slate; restart in a vacant niche.

We bet on Red Queen pressure — alternating opponents, forcing each to keep adapting.

If the experiment had stalled, we had four backup mechanisms ready.

Learning speed comparison: co-evolution 35% faster than isolated evolution

Decisions 5 & 6: The Guardrails

What's allowed to change? When do we stop watching?

⚙️ Mutation Bounds

Free-form — LLM rewrites anything

Maximum exploration. Risks compile failures, runtime crashes.

Constrained — only specific functions

Focused, safe. May miss novel solutions outside the sandbox.

AST-safe — auto-injected loop guards

Compile-time inserts memory bounds, infinite-loop detectors.

M25 chose: Free-form + AST-safe

LLM rewrites freely; we inject loop-guards & memory bounds at compile. No infinite loops. No leaks. No black-box crashes.

🛑 Termination Criteria

Fixed budget — rounds, time, or $

Predictable cost. Easy to compare experiments.

Plateau detection — stop after K silent rounds

Saves money on dead runs; risks premature stop.

Goal achieved — reach target fitness

Clear success criterion; assumes you know the goal.

Open-ended — never stop

Let it run. Cheap if you have spare compute.

M25 chose: Fixed budget — 95 rounds, ~$10

Predictable, re-runnable for ablation studies, fits a lunch break.

Bounds = "what" can change. Termination = "when" we stop watching.

The Future of Evolutionary Code

If we're right, this is just the beginning

🌐 Co-Evolving Microservices

Frontend and backend evolve together, optimizing for latency, throughput, cost. APIs compete for efficiency.

🛡️ Immune-System Software

Code that adapts to attacks in real time. Firewall rules evolve against adversarial traffic patterns.

🔧 Evolutionary Debugging

Mutate code until tests pass. Fitness = % tests green. Let evolution fix bugs while you sleep.

🤝 Symbiotic Codebases

Modules co-evolve for mutual benefit. Database queries optimize alongside indexing strategies.

What if all software was alive?

Credits & Reproduction

Standing on the Shoulders of Giants

Intellectual Foundations

Charles Darwin — Natural selection, Origin of Species
Gregor Mendel — Genetics, inheritance mechanisms
Stephen Jay Gould — Punctuated equilibrium (1972)
Leigh Van Valen — Red Queen hypothesis (1973)
Sewall Wright — Fitness landscapes (1932)

Portraits of Darwin, Mendel, Linnaeus, Gould, and Van Valen

Tools & Technologies

Claude Sonnet 4 — Mutation planner
Claude Haiku 4.5 — Code writer
OpenACC — GPU parallelization
C++17 — Implementation language

Reproduce This Experiment

git clone https://github.com/leybzon/SwarmEvolve
cd SwarmEvolve
python3 scripts/evolve_coevolve.py \
  --init-champion-a data/runs/m22_rq1_100gen/gen_0033/candidate.cpp \
  --init-champion-b src/baselines/pursuit_v1.cpp \
  --planner-model claude-sonnet-4-20250514 \
  --coder-model claude-haiku-4-5 \
  --rounds 100 --n-matches 10 --seed 42 \
  --acceptance-mode relative --strict-reflection

GitHub: github.com/leybzon/SwarmEvolve
License: MIT (open source)

Questions? Open an issue on GitHub or contact Gene Leybzon.