Documentation
Getting started

Architecture

Three layers, one primitive: an engine that audits, an on-chain registry that attests, and a set of surfaces any agent can call.

The stack

  • Engine (packages/engine) — Slither + Aderyn with 16 custom detectors, a 98-pattern exploit corpus, the cross-validating LLM cascade, consensus scoring, the Arsia gas profiler, and AES-256-GCM encryption.
  • Agent infra (packages/contracts) — AnnealValidation (verdict registry), AnnealAgent (ERC-8004 facade), AnnealStaking (auditor accountability).
  • Surfaces — CLI (with a --threshold exit-code gate), safety-oracle API (packages/web), MCP server (packages/mcp), Telegram bot (packages/telegram), and a GitHub Action that blocks merges in CI.

One audit, end to end

Static analysis and the critic cascade run in parallel. A ChainGPT pre-screen feeds two architecturally-distinct critics — Groq Llama-3.3-70B and OpenAI GPT-OSS-120B (both served on Groq's LPU) — that cross-validate each other; Gemini 2.5 Pro is an optional third critic, off by default. Alongside them run Slither + Aderyn with our 16 custom detectors, and a 98-pattern exploit corpus. A ChainGPT pre-screen failure is non-fatal — the critics still run. The cascade never false-cleans: if nothing could analyze a contract — say a single .solfile with unresolved imports that won't compile and no model response — the verdict is flagged analysisIncompleteand reported as “could not complete the audit,” never as safe or 100/100. Single-contract audits run the full critic cascade by default (thorough), not a quick pre-screen-only pass.

Cross-validation is the moat. A reported finding has to survive corroboration: it needs ≥2 independent sources — two models, or a model plus Slither — or it is dropped. A single-model hunch never reaches the verdict. When the same issue is flagged by several engines, the consensus scorer dedups it (by line overlap) into one finding that lists every source — e.g. Reentrancy — flagged by chaingpt, groq, gpt-oss, slither — boosts confidence on the corroborated finding, floors single-model findings, and culls anything below 20% confidence.

Multilingual reports.The audit runs in English, then Tencent Hunyuan (its Hunyuan-MT model on Tencent Cloud TokenHub) translates the finished verdict and findings into the reader's language — zh, es, ja, ko, fr, pt, de, ru, it, ar, hi, vi, th, tr. On @tryannealbot: /audit <url|address> <lang>(e.g. /audit 0x… zh); on the web /try page, language chips under each result translate it in one click. Translation is credited to Tencent Hunyuan.

Deterministic, reproducible audits.AI audits have a reputation for being non-deterministic — ask twice, get two answers. TryAnneal's verdict is reproducible: the same contract always returns the same result, run to run. Every model decodes at temperature 0 (greedy, seeded), so each pass is identical; the corroboration rule above keeps a single-model hunch from ever drifting the verdict; scoring is confidence-weighted; and both the Telegram bot and the hosted MCP memoize by code hash (keccak/sha3 of the source), so identical source returns the identical audit.

Trust model

  • No single LLM is trusted — Slither/Aderyn cross-validate every flagged line, and ≥2-model agreement raises confidence.
  • Findings are encrypted; only the verdict score + severity counts are public on-chain. Destroying the key crypto-shreds the report.
  • Verdicts carry the posting agentId — consumers weight by on-chain reputation, never blind trust.