Arabic-first voice AI

The voice AI layer
for the Arabic internet.

Native-quality Arabic text-to-speech, tuned by native speakers, deployed across four regions.

Start for freeHear the voicesNo card. 30 seconds to first synthesis.

Hakim Saree' v1.3 · TTS

Hakim Arab v2 · STT

99.97%· 30-day uptime

Hear the range

Twelve voices. Fifteen Arabic dialects. One model family.

Tap a voice. We stream a six-second sample from the same endpoint your app will call. No sign-up.

Layla

Modern Standard · Narrative

Layla · MSA· Narrative

0:000:00

Transcript

Hi, I'm Layla. A Modern Standard Arabic voice, calm and assured, built for narration, audiobooks and e-learning.

How we stack up

Built for Arabic, benchmarked against the world.

We ship voice infrastructure for the Arabic internet, so the comparison that matters runs on Arabic content, not English-tuned demos. Numbers below are calibrated against the latest production tier each vendor offers, measured the week of launch.

Arabic dialects shipped

MSA + 14 regional varieties with named voices

113 ms

TTS time-to-first-byte, p50

Arabic content from the region you pick, production checkpoint

leads

Word error rate on Arabic STT

Arabic-LibriSpeech eval set, MSA plus four dialects

Hakim versus the leading TTS, STT, and voice-cloning APIs on Arabic content.
Hakim versus the leading TTS, STT, and voice-cloning APIs on Arabic content.	AzureCognitive Services	HakimBuilt for Arabic	ElevenLabsMultilingual v2 + Flash	GoogleCloud Text-to-Speech	OpenAItts-1 + Whisper	Fish Audiofish-speech v1.5
Arabic excellence
Arabic dialects with named voices Counted from each vendor's TTS catalogue, not generic 'Arabic' fallbacks.	3	15	1	2	1	1
Arabic STT word error rate Lower is better. 7-hour Arabic-LibriSpeech eval, MSA + four dialects.	Trails	Leads	—	Trails	Competitive	—
Voice cloning quality on Arabic How the vendor handles a 30-second Arabic sample, accent, register, dialect.	Enterprise	Native	English-tuned	Limited	—	No consent gate
Performance
TTS time-to-first-byte on Arabic p50 first-audio-chunk from prompt-submit. EU-Frankfurt, 1 Gbps link, January 2026.	350 ms	113 ms	320 ms	380 ms	700 ms	280 ms
Streaming TTS	WebSocket	Chunked	Chunked	Chunked	Chunked	Partial
STT billing granularity Per-second billing means short clips don't round up to a full minute.	Per-second	Per-second	—	Per 15 seconds	Per minute	—
Platform & trust
OpenAI-compatible REST API Drop-in replacement for /v1/audio/speech and /v1/audio/transcriptions.	No	Yes	No	No	Native	Partial
Data residency regions Where your audio is processed and stored, not just billed.	EU only	GCC + EU	US only	EU only	US only	US / CN
Voice-clone consent gate An on-platform consent step before a custom voice is callable in production.	Enterprise	Yes	Yes	No	—	No

Hakim leads on 6 of 9 dimensions and ties on 3 more.

Methodology + sources

Numbers from each vendor's public documentation and our 2026-Q1 internal benchmarks against their highest-quality production tier (ElevenLabs Flash, Azure Neural HD, Google Studio, OpenAI tts-1, Fish Audio real-time). Arabic STT WER measured on a 7-hour Arabic-LibriSpeech evaluation set covering MSA plus four major dialects. Vendor brand names are the trademarks of their respective owners.

Try it yourself

Type a line. Hear Hakim read it back.

Direct call to the same production pipeline your app will use. 5 free generations per day, no sign-up, no card.

01 / 05

Voice is the next interface.

Most of the world communicates in speech. AI, until now, has assumed everyone types.

The web has grown to five billion people. Typing, in your first language, without a keyboard that fights you, has not. The gap between 'has a phone' and 'can type comfortably' is where voice belongs.

People online with limited typing fluencyapprox. global

Sources: ITU global internet estimate, Ethnologue first-language speakers, regional typing-literacy surveys. Full references in the launch thesis post.

02 / 05

Arabic isn't one language.

It's Modern Standard Arabic plus fourteen mutually-semi-intelligible regional dialects, sounds that no English-first model was trained on.

Then there's the diacritic system most real-world text omits, and a script that writes right-to-left and connects letters inside words. 'Arabic TTS' built on an English-first stack mispronounces common names by the second sentence. We built around that, not past it.

Supported dialects

MSA

Khaleeji

Egyptian

Levantine

Maghrebi

Iraqi

Sudanese

Yemeni

Najdi

Sounds English doesn't have

ع/ʕ/voiced pharyngeal

ح/ħ/voiceless pharyngeal

خ/x/velar fricative

ق/q/uvular stop

ص/sˤ/emphatic s

Supported range refers to current coverage in Hakim Saree' v1.3 TTS and Hakim Arab v2 STT; see the models page for the full capability matrix.

03 / 05

How the Hakim TTS family is built.

Arabic-first TTS engineered for streaming speed. Hakim Saree' v1.3, tuned by native speakers across 15 dialects. The first byte of streaming audio lands in roughly 113 ms p95.

Hakim Saree' (سريع, *swift*) is our streaming-speed tier · the endpoint tuned for the lowest first-byte latency. It reads Arabic script and diacritics natively, handles English inside an Arabic sentence without switching voices, and exposes the same endpoint whether you pick a region in the EU or the GCC. One API, one authentication primitive, the same billing model everywhere.

Time-to-first-audio p95· Rolling 15-second window across live regions

Arabic dialects

MSA + 14 regional dialects

Languages

Non-Arabic supported range

Live regions

GCC + EU

Every voice is recorded and QA'd with a native speaker before it ships. Evaluation methodology and the full three-tier capability matrix live on the models page.

04 / 05

How Hakim Arab v2 is built.

Arabic-first streaming STT. Dialect-aware, code-switch-aware, interactive latency, first token in roughly 90 ms p95.

Hakim Arab is the half of the stack that turns voice into text. It is trained for the sentences Arabic speakers actually say: 15 Arabic dialects, English phrases inside an Arabic sentence treated as first-class input, and proper nouns + emails + numbers preserved through the transcript. 'Send the موعد to ahmed@hakim.ai' is exactly the kind of sentence that breaks English-first STT; Hakim Arab reads it straight through.

Listen · Khaleeji sample

A Khaleeji voice Hakim Arab v2 is designed to transcribe. Play, then imagine the same accuracy going the other way.

Reem · Khaleeji· Khaleeji · narrative register

0:000:00

Streaming latency p95

First-token from speech-in

Arabic dialects

Available today

Languages

Hakim Arab v2 supported range

No accuracy or word-error-rate claim is published on this page, we hold that conversation to the models page, where the evaluation protocol is documented.

05 / 05

Your audio stays where you chose.

Four live regions. GCC (UAE, KSA, Qatar) plus EU Frankfurt. You pick where your audio is processed; it doesn't leave that region.

Your data stays where your customers and regulators expect it to stay. Pick a region when you create a project, and every request, synthesis, transcription, storage, is handled inside that region. No silent cross-border routing, no surprises in your audit trail.

Frankfurt· EU · liveDoha· GCC · liveDubai· GCC · liveRiyadh· GCC · live

Our compliance posture

SOC 2 Type II · audit in progress GDPR · Art. 28 processor EU AI Act · Art. 50 UAE PDPL · compliant practices KSA PDPL · compliant practices GCC + EU · 4 live regions

Compliant, not certified: every badge links to the exact control mapping on our security page. SOC 2 Type II audit is in progress; the Type I report is available on request.

What you can build

Three Arabic-first voice models. One API.

Text-to-speech, speech-to-text, and voice cloning, all native-quality across Arabic dialects and 30+ languages, all behind the same authentication and billing primitives.

TTS · Hakim Fast v1.3

Text to speech, Arabic-first.

Arabic-first streaming TTS at 113 ms p95 time-to-first-audio. MSA plus 14 regional Arabic dialects and 30+ other languages out of the box, served from your choice of four live regions.

See TTS capabilities

STT · Hakim Arab v2

Speech to text, dialect-aware.

Arabic-first streaming STT. Fifteen Arabic dialects, English and Arabic in the same sentence, 90 ms p95 first token, delivered from your choice of four live regions.

See STT capabilities

Voice cloning

Clone any voice instantly.

Record ten seconds, get a production voice that follows your clone's dialect and register. Cloning is instant, and custom voices share the same inference stack as Hakim Fast v1.3, same latency, same quality bar.

See voice cloning

Ship in minutes

Five lines to a voice.

Drop-in SDK for Node and Python, plus pure cURL and browser-JS snippets. Every example hits the same pipeline the paid API uses, no separate staging path.

Hakim Fast v1.3 · default TTS model · streaming, low-latency, 15 dialects
Reem · Khaleeji · Khaleeji preset · one of twelve voices shipped with every account

import { Hakim } from '@hakim/sdk-node';

const client = new Hakim({ apiKey: process.env.HAKIM_API_KEY });

const audio = await client.audio.speech.create({
  model: 'hakim-fast-v1',
  voice: 'reem-khaleeji',
  input: 'أهلاً وسهلاً بكم في حكيم.',
  format: 'mp3',
});

await audio.writeToFile('hello.mp3');

Built for enterprise

Compliant, not certified, and happy to show our work.

We operate to SOC 2 Type II, GDPR Article 28 processor, EU AI Act Article 50, UAE PDPL, and KSA PDPL requirements. The SOC 2 Type II audit is in progress; the Type I report is available on request. Every badge on this page links to the exact control mapping on the security page.

SOC 2 Type II · audit in progress GDPR · Art. 28 processor EU AI Act · Art. 50 UAE PDPL · compliant practices KSA PDPL · compliant practices GCC + EU · 4 live regions

99.9%

Uptime SLO

Rolling 30-day target

113 ms

TTS p95 across regions

Rolling 24 hours, all live regions

Live data-residency regions

UAE · KSA · Qatar · EU-Frankfurt

Our compliance posture today. No badge on this page implies a completed auditor attestation unless the security page's control mapping says so explicitly.

Read the security page

Frankfurt· EU · liveDoha· GCC · liveDubai· GCC · liveRiyadh· GCC · live

Pricing

Pricing that scales with you.

Free while you prototype. Pay only for what you use once you ship. Enterprise terms on request, with data residency, custom retention, and signed BAAs where they matter.

Prototype

Free

$0/ mo

Build against the same production pipeline the paid tiers use. Zero card, zero commitment.

10,000 credits every month
Full 12-voice preset catalogue · Playground
Community support

Start for free

No card required. Upgrade any time from the dashboard.

Ship it

Creator

$19/ mo

The anchor plan for individual developers and small teams going to production.

500,000 credits every month
All presets plus 30 custom voices · overage billing
Email support, 24-hour target response

Choose Creator

Custom

Enterprise

Custom/mo

Usage-based pricing, signed MSA, data-processing addenda, and the residency posture your legal team asked for.

Signed DPA and MSA · custom retention · BAA on request
Pick any of 4 live regions · dedicated tenants on contract
Unlimited voices · unlimited keys · on-prem optional
Slack Connect · dedicated CSM · 24/7 sev-1

Talk to sales

See all plans

One more thing

Your users are waiting for a voice that sounds like them.

Five thousand free characters today. Twelve voices and every dialect we support. Thirty seconds to your first synthesis.

Start for freeNo card. Upgrade when, and if, you grow out of the free tier.

The voice AI layerfor the Arabic internet.