How We Test

hvilkenAI.no tests AI models on practical, everyday tasks — not academic benchmarks.

🛡️

Independent and Unbiased

hvilkenAI has no affiliate agreements, sponsors, or commercial partnerships with the AI providers we test. We receive no commission, discounts, or benefits from any model provider. All recommendations are based solely on test results. We are funded by subscription revenue from Pro users and advertising — never by the providers we evaluate.

Our Philosophy

We test what people actually use AI for: writing emails, summarising text, answering questions, following instructions in Norwegian, Swedish, and Danish. If a model scores well with us, it will work well for you.

What We Measure

Norwegian Language Quality (0–5)

How well does the model understand and write Norwegian Bokmal? Did it respond in Norwegian, or did it fall back to English?

Instruction Following (0–5)

Does the model do what you actually ask for? Correct length, format, and content matter.

Speed (tokens/second)

How quickly do you get a response? We measure tokens per second and time to first token (TTFT).

Price (kr per million tokens)

What does it cost in Norwegian kroner? Updated daily based on exchange rates.

Overall Score (0–10)

Weighted total assessment combining Norwegian, instruction, speed, and value per krone.

Orchestrator Score (0–10) — unique to hvilkenAI.no

How well suited is the model to control other AI models in Norwegian? Calculated from Norwegian × instruction — multiplication penalises weakness in both dimensions. A model that doesn’t write Norwegian cannot orchestrate effectively in Norwegian.

View orchestrator ranking →

Model Selection and Test Frequency

Every morning, we evaluate over 350 available models via the OpenRouter API. We automatically select the 12 best-performing models, distributed across three price tiers: premium, mid-range, and budget. The selection is not hardcoded — new models are automatically tested when they appear, and models that fail are replaced by the next candidate from the same price tier.

Daily benchmark at 07:30 with standardised tasks per language (Norwegian, Swedish, Danish). Weekly report every Friday with trends and recommendations.

Focus

We focus on practical use in Scandinavia — not academic benchmarks. We test what ordinary people and businesses actually do with AI in everyday life. Results are updated daily, not once per quarter.

Change Log — What We’ve Discovered

Real observations from daily benchmarking. This is what quarterly reports miss.

2026-05-29 Magnum v4 72B entered the top list with a Norwegian score of 4/5 — the highest Norwegian score among all models today.

2026-05-28 GPT-4 (v0314) scored 0/10 — the outdated model was automatically replaced by the next candidate from the premium tier.

2026-05-28 inclusionAI: Ling-2.6-flash jumped from 4.3 → 7.2 overnight without notice from the provider — a silent update caught by daily testing.

2026-05-25 Llama 3.1 8B Instruct improved from 7.3 → 9.0 — a budget model with a sudden performance leap, now among the absolute best.

2026-05-25 Claude Opus 4.7 (Fast) went from 6.4 → 8.2 (+1.8) in one day — a silent provider update without announcement.

2026-05-21 Z.ai GLM 5.1 crashed from 6.5 → 1.2 (-5.3) — API instability from the provider. The model was flagged and a backup candidate activated.

2026-05-20 Z.ai GLM 5.1 appeared for the first time in the benchmark with a score of 6.5/10.

2026-05-18 AionLabs: Aion-1.0 scored 0/5 on Norwegian at debut — premium tier, but failed to handle Norwegian. Automatically replaced.

Why Daily Testing?

Most AI benchmarks are published monthly or quarterly. But AI models are updated continuously — often without the provider announcing it. A model that was best last week may have dropped to number 5 this week. Daily testing captures these changes in real time.

The AI market changes from day to day. Providers update their models without notice — we’ve caught several such "silent updates" because the score suddenly changed. A quarterly report doesn’t capture this. Daily testing does.

For businesses using AI in daily operations, this means your decision-making is always up to date. You don’t need to wait 3 months for the next report to know if you’re using the right model.