Methodology

A complete, transparent breakdown of how the experiment works — what data the models receive, how they process it, and how trades are scored.

01 — Overview

Experiment Design — A Rolling, Multi-Season Format

The AI Trading Benchmark runs as a series of monthly 4-week seasons. Each season pits 2–4 frontier AI models head-to-head under identical conditions — same data, same prompts, same agents, same broker. The only thing that changes between contestants is the model API call. The lineup rotates each season as new models ship or older ones are retired.

Each model manages a fresh $50,000 demo account hosted by Pepperstone Markets with strict 2% risk per trade. Every trade, every decision, every “no-trade” call is logged and published.

Season Roster

Season 1 · Apr 13 – May 12, 2026 · Completed

Claude Opus 4.6 (Anthropic) vs GPT-5.4 (OpenAI). Final standings: Claude +4.53% ($52,267), GPT +8.90% ($54,449), combined +6.72%. Scope refined mid-season to 4 instruments — NAS100, US30, SPX500, EUR/USD. XAUUSD and USDJPY were removed after early data showed neither model held an edge there.

Season 2 · May 13 – Jun 5, 2026 · Live

Claude Opus 4.6 (continuing — same master automations) vs Claude Opus 4.7 (new — replaces GPT-5.4 on the custom automations). Same 4-instrument scope, fresh $50K accounts.

Why this format: model releases happen every few weeks. A rolling monthly schedule keeps the benchmark current without freezing a stale matchup. Past seasons stay live on the site as Final Standings — see the leaderboard on the home page.

02 — Execution Environment

Live Trading on Pepperstone Markets

This is not a backtest. Both models trade in real-time on live market conditions using two separate $50,000 demo accounts hosted by Pepperstone Markets, a globally regulated broker operating under ASIC, FCA, and CySEC oversight. Demo accounts mirror live market conditions — real spreads, real price feeds, real execution timing.

Broker Details

Broker: Pepperstone Markets
Account Type: Demo — Standard Spread, No Commission
Platform: cTrader / MT5
Regulation: ASIC, FCA, CySEC, DFSA, SCB

Account Configuration

Starting Balance: $50,000 per model
Risk Per Trade: 2% of account equity
Execution: SkyAnalyst Proprietary Trading Bridge → cTrader / MT5
Verification: All fills independently auditable via broker statements

Third-Party Verification — Myfxbook

Both accounts are publicly tracked on Myfxbook for additional transparency. Equity curves, trade history, drawdown, and profitability metrics are independently verified and available in real-time under the following profiles:

GPT vs Claude Phase 1 2026 APR 26Myfxbook Verified

GPT vs Claude Phase 2 2026 APR 26Myfxbook Verified

GPT vs Claude Phase 3 2026 APR 26Myfxbook Verified

A Note on Index Pricing

The EUR/USD pair has standardized pricing across brokers — the entry and exit prices you see in our articles will closely match what you see on your own platform. US index CFDs (NAS100, US30, US500) are different. Each broker constructs its own index price feed, which means entry prices, stop distances, and P&L figures for index trades are specific to Pepperstone Markets. If you trade US indexes on a different broker, your prices will vary. All trades in this experiment were analyzed, executed, and settled on Pepperstone demo accounts using Pepperstone's price feed.

Trades are executed automatically through the SkyAnalyst Proprietary Trading Bridge, which connects the AI analysis layer directly to cTrader and MT5. When a model issues an entry signal, the bridge translates it into a real order with proper lot sizing (calculated from the 2% risk rule and the structural stop distance), stop loss, and three take-profit levels — executed in milliseconds with no human in the loop.

Why demo accounts instead of live capital? Transparency and reproducibility. Demo accounts on Pepperstone mirror live market conditions while eliminating the variable of capital risk, which would introduce regulatory and ethical complications for a public experiment. The results are verifiable, the conditions are real, and the execution is identical to what a live account would produce.

03 — Data Pipeline

What the Models Receive

Before each trading session, SkyAnalyst AI assembles a structured data packet of approximately ~100,000 tokens per instrument. This is not a simple price feed — it is a professional-grade analysis environment equivalent to what a Chartered Market Technician would review before making a trading decision. The data packet contains four layers:

Layer 1 — Multi-Timeframe Candle Data

5 hours of price action across three timeframes — 60-minute, 15-minute, and 5-minute candles. Each candle includes open, high, low, close, and volume, plus a full indicator overlay:

EMA (Fast/Slow)ATR (Volatility)MACD + HistogramRSIVolume + SMAVWAP + Std Dev Bands

Layer 2 — Session Structure & Fibonacci

Key reference levels from each trading session — Tokyo, London, and New York highs and lows — plus Fibonacci retracement and extension levels computed from the dominant swing. These provide the structural framework the models use to identify entry zones, stop loss placement, and profit targets.

Layer 3 — Macro Context Window (5-Day)

A rolling 5-day snapshot of cross-asset market conditions, delivered as structured JSON with daily values, EMAs, and range positions:

10Y Treasury YieldDXY (Dollar Index)VIX (Volatility)NYAD (Breadth)Oil (WTI)

Layer 4 — AI Agent Pre-Analysis

Before the models even begin their analysis, two specialized AI agents within SkyAnalyst AI have already processed the data:

Macro Analysis Agent

Synthesizes the macro environment, economic calendar releases, and intermarket correlations into a directional bias with a confidence score and tradeability rating. Outputs both an intraday and multi-day horizon assessment.

Trend Authority Agent

Evaluates the technical structure — EMA alignment, momentum, regime classification (trending, ranging, volatile) — and provides direction, confidence, key support/resistance levels, and an invalidation price. Also recommends position sizing adjustments based on volatility.

The complete data packet also includes the day's economic calendar with impact ratings, any pre-market news summaries, and the previous session's analysis for continuity. Prompts may vary per instrument to account for asset-specific dynamics (equity index vs. forex vs. commodities).

04 — Analysis Framework

Classical CMT Methodology

The analysis framework is rooted in classical Chartered Market Technician (CMT) methodology. We deliberately do not use alternative technical frameworks such as ICT concepts, Fair Value Gaps (FVG), order blocks, or other non-classical approaches. The indicator suite is a curated subset of proven CMT tools — EMA trend structure, RSI momentum, MACD confirmation, ATR-based volatility scaling, VWAP anchoring, and Fibonacci levels.

Each model receives a persistent system prompt that defines its reasoning framework — the “playbook” that stays constant across every session. This is the actual instruction set:

System Prompt — Decision Framework

1. Risk regime:Read the Macro Agent's bias, confidence, and tradeability. Check the cross-asset environment (DXY, 10Y, VIX, NYAD). If tradeability is low, raise the bar or stand aside. Classify the environment.

2. Agent synthesis:Read the Trend Agent's direction, confidence, regime, key levels, and invalidation. When both agents agree with solid confidence, strongest foundation. When they conflict, note why and reduce conviction. Trending regime favors continuation; ranging favors mean-reversion toward VWAP.

3. Session context: Assess gap vs prior close relative to ATR. Read the session handoff from 60min candles — where is price relative to session high/low and VWAP? Identify the 1–2 dominant drivers today.

4. Multi-timeframe read: 60min for bias (EMA 9/21/50, RSI, MACD). 15min for structure. 5min for entry precision — VWAP tests, EMA pullbacks, session levels, Fibonacci zones. Entry zones must be at 5m/15m structural levels.

5. Calendar gate: No entries within 15 minutes of high-impact events. If data already released, assess the reaction and whether it has settled.

6. Build or pass: Only propose setups where macro environment, agent signals, and technical structure all support the direction. If any domain actively contradicts, state the conflict and reduce confidence or pass. Stop placement is structural, scaled to current volatility: on compressed days (VIX declining, narrow ranges), stops tighter near structure; on expanding days (VIX rising, wide ranges), stops wider — but the setup must still meet minimum 1.5:1 R:R after the wider stop. If volatility makes R:R unworkable at structural stop levels, No Trade. If the structural stop exceeds the Trend Agent invalidation level, skip the setup. Add a small buffer beyond the stop for execution slippage — setups are forwarded directly to an automated trading system. TP1 should target 1R–1.25R at a structural level. If no structure exists in that zone, evaluate the full target profile — a close TP1 with a strong TP2 at 2R+ is a valid trade. Reject only when the trade is structurally inverted: the highest-probability exit delivers less than 1R and reaching further targets requires breaking through major levels.

If conviction is low, No Trade is the correct output.

This framework is identical for both models. It defines how they reason — the session data (next section) defines what they reason about.

05 — Session Data

What Changes Every Day

While the system prompt (above) stays constant, the session data changes with every trading day. Below is the structure of the data packet assembled by SkyAnalyst AI and injected alongside the system prompt. Prompts may vary per instrument. The actual data values, candle arrays, and agent outputs are omitted — in production, this packet is approximately ~100,000 tokens.

// System: You are an expert CMT trading analyst...

=== MACRO ANALYSIS AGENT (EURUSD / forex) ===
Group Bias: bull (confidence: 28%) | Data age: 12min
EURUSD Bias: strong_bear (score: -75) | Confidence: 22%
Horizon: intraday=bear, short-term=strong_bear
Tradeability: high (72/100)
[BEARISH] Fed-ECB rate divergence: 160+ bps...
[BEARISH] Eurozone energy shock: 3-4x baseline...
=== END MACRO ANALYSIS AGENT ===

=== TREND AUTHORITY AGENT ===
Direction: BULLISH | Confidence: 64% (Moderate)
Regime: Trending | Strength: Moderate
Key Resistance: 4593.3 | Key Support: 4553.28
VWAP: 4554.9 | Invalidation: 4546.92
Recommendation: Reduce size (VIX elevated)
=== END TREND AUTHORITY AGENT ===

### Economic Calendar
10:00am USD - JOLTS Job Openings [HIGH IMPACT]
(Forecast: 6.89M, Previous: 7.24M)

### Market Indicators (5-Day JSON)
{ vix, dxy, oil, gold, 10y_yield, nyad... }

### 60min Candles (5h) + EMA, ATR, MACD, RSI, Vol, VWAP
### 15min Candles (5h) + full indicator suite
### 5min Candles (5h) + full indicator suite

### Session Levels
Tokyo High/Low, London High/Low, NY High/Low
Fibonacci retracement/extension levels

// Instruction: Follow the 6-step framework.
// Produce 0-2 setups with SL and 3 TP levels.
// If conviction is low, No Trade is correct.

Note: This is a simplified representation. The full prompt includes complete candle arrays with all indicator values, detailed agent reasoning, and instrument-specific context.

06 — Model Configuration

API Settings & Fairness Controls

Both models are called via their respective official APIs with settings designed for maximum reasoning depth. Neither model has a temperature advantage — GPT 5.x does not support temperature settings, and Claude Opus 4.6 uses its default. Both receive identical input data and identical system prompts.

Setting	GPT-5.4	Claude Opus 4.6
Model ID	gpt-5.4-2026-03-05	claude-opus-4-6
Max Tokens	25,000 tokens	25,000 tokens
Temperature	Not supported (GPT 5.x)	Not explicitly set
Reasoning Effort	High	Default (High)
Timeout	120s (2 min)	120s (2 min)
API Endpoint	/v1/chat/completions	/v1/messages

Both models receive streaming responses. The difference in max token limits reflects each provider's API defaults — in practice, session analyses typically use 4,000–8,000 completion tokens.

07 — Trade Output

What the Models Produce

After processing the ~100K token data packet, each model produces a comprehensive session analysis that typically identifies 1–2 trade setups. Each setup includes:

Direction & Thesis

Long or short, with a written rationale explaining the confluence of signals

Entry Zone

A price range at a structural level, not a single price point

Stop Loss

Structural stop scaled to current volatility via ATR, with slippage buffer

3 Take-Profit Levels

TP1 at ~1R (structural), TP2 at ~2R, TP3 at extended target. Minimum 1.5:1 R:R required

Confidence Gate

6-factor confluence scoring: macro alignment, yield direction, DXY, trend agent, key level, EMA stack

Risk Assessment

Specific risks to the trade with mitigation strategies (news events, resistance clusters, VIX-driven sizing)

Once a setup triggers, the model monitors price in real-time and issues an entry signal with a confidence percentage. The entry is then executed automatically via the SkyAnalyst Proprietary Trading Bridge to cTrader / MT5 on Pepperstone Markets — no human touches the trade at any point.

08 — Infrastructure

Why This Cannot Be Replicated in ChatGPT or Claude Alone

You cannot reproduce this experiment by pasting a prompt into ChatGPT or Claude. Here's why:

No access to real-time market data

ChatGPT and Claude do not have access to live price feeds, broker candle data, or real-time economic calendar releases. The ~100K token data packet is assembled by SkyAnalyst AI from live broker APIs, structured and formatted specifically for LLM consumption.

No AI agent pre-processing layer

The Macro Analysis Agent and Trend Authority Agent are proprietary SkyAnalyst AI systems that run independently before the trading model sees the data. These agents provide the bias, confidence, regime classification, and tradeability scores that form the foundation of every analysis. Without them, the model is working blind.

No broker bridge for live execution

Analysis without execution is academic. The SkyAnalyst Proprietary Trading Bridge connects directly to cTrader and MT5 on Pepperstone Markets, translating the model's trade signals into real orders with proper lot sizing, stop loss, and take profit levels — executed in milliseconds. This is the difference between a research paper and a live trading system.

No real-time monitoring and entry timing

After the model identifies a setup, SkyAnalyst AI continuously monitors price action in real-time, waiting for the exact entry trigger conditions to be met. The model evaluates each new candle against its setup criteria and issues an entry signal only when conditions align. This monitoring loop cannot happen in a chat interface.

The point of this infrastructure is not complexity for its own sake. It exists to emulate what a professional trader actually does: read the macro environment, analyze multi-timeframe technicals, identify a setup with defined risk, wait for precise entry conditions, and execute with discipline. SkyAnalyst AI gives the trading model everything it needs to do this — live data, preprocessed context, real-time monitoring, and broker execution — under the same conditions and constraints a professional desk would demand. A backtest can always be curve-fit. A live demo account operating under real market conditions, with real spreads, real slippage, and real execution timing, cannot.

09 — Trading Rules

Constraints & Risk Management

Trading Window

8:00–11:00 AM EST daily

Starting Balance

$50,000 per model

Risk Per Trade

2% of account equity

Minimum R:R

1.5:1 or no trade

News Exclusion

No entries within 15 min of high-impact events

Execution

SkyAnalyst Proprietary Trading Bridge → cTrader / MT5 on Pepperstone

Profit-Taking Policy: Full Close at TP1

While each setup defines three take-profit levels (TP1, TP2, TP3), the full position is closed at TP1 — the first target — for benchmark scoring purposes. This keeps results measurable and directly comparable across models. The AI identifies TP2 and TP3 as part of its analysis framework (demonstrating its ability to read extended targets), but actual P&L is realized at TP1.

Additional risk controls:

Stop placement is structural, scaled to current volatility via ATR
If VIX is elevated, position size is reduced per Trend Agent recommendation
If structural stop exceeds Trend Agent invalidation level, setup is skipped
If volatility makes R:R unworkable at structural stop levels, No Trade
A “no-trade” decision is a valid, scored action

10 — Scoring

How Trades Are Scored

Performance is evaluated across five dimensions:

Total P&L

Absolute dollar and percentage return on the $50K account

Win Rate

Percentage of trades closed in profit

Max Drawdown

Largest peak-to-trough decline during the competition

Risk-Adjusted Return

Sharpe-like ratio measuring return per unit of risk taken

Consistency

Performance stability across trading days — penalizes boom/bust patterns

11 — Disclosure

Transparency & Disclosures

This experiment is conducted by The AI Trading Benchmark, with trading infrastructure provided by SkyAnalyst AI (a product of SkyWeaver Trading LLC) and trade execution hosted by Pepperstone Markets. This is an independent research initiative. Results are published transparently regardless of outcome.

Claude is a trademark of Anthropic. GPT is a trademark of OpenAI. This experiment is independent and not endorsed by either company.

Trading involves risk. This experiment uses demo accounts and is conducted for educational and research purposes. Past performance — including results from this experiment — does not guarantee future results. Nothing published on this site constitutes financial advice.

View Phases Back to Home