HF Space - self-improving scientific agents

Self-Improving Open Scientific Agents

Designed and developed by Weinan Wang
A research demo for scientific AI agents that route tasks through an OpenClaw-style gateway, test Hermes-compatible reasoning lanes, critique their own outputs, and improve through benchmarks, memory, and reproducibility checks.
OpenClaw-style orchestration Hermes-compatible backend lane Self-critique + benchmark gate Statistics + numerical analysis + decision agents
Improvement loop at a glance
Try a scientific request and watch the system draft, critique, revise, benchmark, stage memory, and export an audit trail for the next run.
Critique trailGateway, planner, specialist, critic, benchmark, memory, reproducibility, and report agents expose handoffs.
Learning gateCritic findings, score trajectories, judge rows, and human-gated memory separate improvement from drift.
Python/R/MATLABAdapter sketches describe equivalent backend and payload routes.
Downloadable workMemo, notebook, audit CSV, payload, trace, and artifact bundle preserve each improvement cycle.
Statistics critiqueDiagnose a statistical workflow, score assumptions, and stage reusable lessons.
Solver benchmarkRoute numerical convergence evidence through critic and reproducibility gates.
Decision auditReview a control/OR policy, identify risk, and prepare the next run.
Self-Improving Agents->Statistics->Numerical Analysis->Optimal Control / OR

Control Room

Agent team
Reasoning backend lane
Orchestration layer
1 5
Memory mode
Benchmark gate
Safety mode

When off, benchmark-approved lessons are staged but not promoted into next-run memory.

Research mode

Quick starts

Scope note: this deterministic demo shows workflow, critique, benchmarking, and product direction without claiming unsupervised external actions.
A lab for agents that learn from their own runs
Draft, criticize, benchmark, remember.
This demo is for scientific AI systems that should not answer once and disappear. It routes a request through specialist agents, attacks the first draft, scores the result against reproducibility and risk gates, stages human-approved lessons, and prepares a stronger next run.
Self-critique with receiptsThe critic records what failed, why it matters, and what should change in the next attempt.
Benchmark gateQuality, risk, reproducibility, and reviewer signals decide whether a lesson is worth keeping.
Open-model storylineShows OpenClaw-style orchestration and Hermes-compatible reasoning lanes with clear scope boundaries.
Shared memoryApproved lessons can improve the Statistics, Numerical Analysis, and Optimal Control/OR agents.