AISA-ArabicFC

The Challenge

Teach machines to act in Arabic.

علّم الآلات أن تتصرف بالعربية.

Function calling — the bridge between language models and the real world — has exploded in English. Models can book flights, query databases, and chain tools into autonomous agents. For Arabic, with its rich dialectal variation and morphological complexity, this capability barely exists.

AISA-ArabicFC closes that gap. Given an Arabic query in any of five dialects, your system must decide whether a tool call is needed, choose the right function from a candidate set, and extract structured arguments — optionally producing an Arabic reasoning trace.

A user in Cairo says "عايز أحجز دكتور". A user in Riyadh says "أبي أحجز موعد عند الدكتور". Same intent. Same tool. Your model has to know.

12,125

total samples

Arabic dialects

structured tools

service domains

A unified reference architecture for agentic AI systems.

بنية مرجعية موحّدة لأنظمة الذكاء الاصطناعي التوكيلي.

A reference architecture proposed by Tuwaiq Academy — the foundation behind AISA-ArabicFC.

Design Evaluation Governance Deployment

Read the Paper 🤗 HuggingFace

01 · CONCEPT

What is AISA?

AISA is an implementation-neutral reference architecture for building agentic AI systems. It formalizes how reasoning, execution, infrastructure, evaluation, and governance interact to produce reliable, scalable, and auditable agent behavior.

02 · STRUCTURE

Architecture Overview

AISA defines six interacting layers — intelligence, cognition, tooling, evaluation, deployment, and governance — that together describe how a complete agentic system is designed, run, and audited.

Cite: Nacar, O., Deema, A., & Mohammed, A. (2026). AISA: A Unified Architecture for Agentic AI Systems. Zenodo. https://doi.org/10.5281/zenodo.18161880

Three Tracks

Pick your challenge.

اختر التحدي الذي يناسبك.

Submit to Track A, Track B, or both. Every submission is automatically scored on Track C for dialect robustness.

TRACK A · CORE

Function Call Detection & Selection

The flagship track. Given an Arabic query and candidate tools, decide / select / extract.

Binary call decision
Function name selection
Argument extraction (JSON)
Open to all model architectures

TRACK B · REASONING

Reasoning-Augmented Calling

Everything in Track A, plus an Arabic reasoning trace inside <think> blocks before the tool call.

All Track A requirements
Arabic reasoning generation
Reasoning-action consistency scored
For interpretable agentic AI

TRACK C · DIAGNOSTIC

Cross-Dialect Robustness

Not a separate submission — an automatic dialect-stratified breakdown of your Track A or B results.

MSA · Gulf · Egyptian
Levantine · Maghrebi
Per-dialect FnAcc + ArgEM
Dialect gap score (max−min)

The Dataset

12,125 Arabic queries. Real domains. Real dialects.

١٢٬١٢٥ استعلامًا عربيًا · مجالات حقيقية · لهجات حقيقية.

Hosted on Hugging Face at TuwaiqAcademy/AISA-ArabicFC. Training and development splits available now — blind test set released July 20, 2026.

Split	Samples	Notes
Train	10,550	Available now
Dev	525	Available now
Test (blind)	1,050	Released July 20
Positive samples	12,000	Tool call required
Negative samples	125	No-call cases
Reasoning traces	12,000	For Track B

↓ Load it in 3 lines

# pip install datasets
from datasets import load_dataset

ds = load_dataset("TuwaiqAcademy/AISA-ArabicFC")
print(ds["train"][0])

i Dialect breakdown

Dialect	%
Modern Standard Arabic (MSA)	58.3%
Levantine	16.9%
Egyptian	12.2%
Gulf	11.3%
Maghrebi	1.3%

Eight service domains

Real-world Arabic services. Tools designed by domain experts.

🏥 Healthcare

book_doctor_appointment · search_medications · check_insurance_coverage

🏦 Banking & Finance

transfer_money · convert_currency · calculate_customs

🏛️ Government Services

check_visa_status · check_iqama_status · check_traffic_violations

🕌 Islamic Services

get_qibla_direction · calculate_zakat · search_quran · calculate_inheritance

✈️ Travel

search_hotels · search_umrah_packages

🌤️ Weather & Environment

get_weather · get_air_quality

🛒 E-commerce

compare_prices · order_food

🔧 Utilities

translate_text · calculate_end_of_service

Evaluation

How submissions are scored.

كيفية تقييم المشاركات.

Two metrics, one weighted composite per track. Argument extraction is weighted higher because it's the primary challenge. Official evaluation script will be released open-source.

Track A

Overall = 0.40 · FnAcc + 0.60 · ArgEM

Track B

Overall = 0.30 · FnAcc + 0.50 · ArgEM + 0.20 · ThinkRate

0.60 / 0.50

ArgEM — Argument Exact Match ★

Strict match of all predicted argument key-value pairs. The headline metric — small fine-tuned models top out around 0.541, GPT-4o sits at 0.070. Massive room to improve.

0.40 / 0.30

FnAcc — Function Name Accuracy

Exact match of the predicted function name (or "none" for negatives — this also penalises hallucinated calls).

0.20

ThinkRate — Track B only

Did the system produce an Arabic reasoning trace before the tool call? Best baseline: 0.868.

Diagnostic

Dialect gap (Track C)

Per-dialect FnAcc + ArgEM, plus the max−min gap. Reported for analysis, not ranking. Gulf and Levantine are the hardest.

Pilot Results

What the field looks like today.

واقع الأداء في المجال اليوم.

Four systems evaluated on the held-out test set. The takeaway: function selection is approachable. Argument exact match is wide open.

System	FnAcc	ArgEM ★	Overall (A)	Overall (B)
AISA-Think Gemma 3 (270M) + LoRA · reasoning-augmented	0.982	0.541	0.717	0.739
GPT-4o Zero-shot prompting	0.927	0.070	0.413	0.313
GPT-4o 3-shot prompting	0.854	0.122	0.415	0.317
Random Random tool from candidate set	0.047	0.033	0.039	0.031

Track B Think-Before-Call rate: AISA-Think 0.868, others 0.000.

🎯

Argument extraction is the core challenge

Best ArgEM is 0.541; GPT-4o achieves 0.070. Hard cases include date format normalization, numeric type handling, and dialectal argument phrasing.

⚡

Small fine-tuned models beat GPT-4o

A 270M LoRA model outperforms GPT-4o across every metric — showing the value of task-specific Arabic training over scale alone.

🌍

Dialect gaps are significant

FnAcc varies by up to 17.8pp across dialects. Gulf and Levantine Arabic are consistently the hardest — Track C surfaces this directly.

Prizes

$1,000 for the top three systems.

جوائز نقدية للأنظمة الثلاثة الأولى.

Cash prizes awarded to the best-performing systems on the final test set leaderboard. Open to every registered team worldwide — no fees, no nationality requirements.

🥈

2nd Place

$300

Cash Prize

🥇

1st Place · Winner

$500

Cash Prize

🥉

3rd Place

$200

Cash Prize

$1,000 total

Ranking is based on the Track A Overall score on the blind test set (released July 20). Winners announced July 30 and recognised at ArabicNLP 2026 in Budapest.

How to Participate

From registration to leaderboard to paper.

من التسجيل إلى لوحة الصدارة إلى الورقة العلمية.

The shared task runs in four clear phases. Registration opens first, then the data drops, then teams iterate on the live dev leaderboard, then the blind test settles the final ranking.

Register your team

Registration is open now · sign up your team — name, leader, registered email, HF account. Locks you in to submit on the leaderboard from day one. Closes July 20.
Open the form →

Get data, baselines & eval

Released June 1 · we drop the training + dev splits on Hugging Face, the official evaluation script, and the baseline model code. Pick any approach — fine-tuned LLMs, prompting, retrieval, hybrid — and get building.

Climb the dev leaderboard

June 1 → July 20 · iterate freely. Upload predictions to the live leaderboard, get auto-scored on the dev split, watch your rank update. Submit as many system variants as you want — the dev board is your sandbox.

Test set & paper

July 20 → Aug 22 · the blind test set drops July 20. Run your best system, submit final predictions. Results published July 30. Camera-ready system description paper due August 22.

📝

Registration is open — sign up your team now

Open the official Tuwaiq Academy registration form: tuwaiq.edu.sa/form/rL7Bl3wq. Closes July 20.

Open form →

Required citations for system description papers

All participating system description papers must cite the three references below. See the Citation section for ready-to-use BibTeX entries.

Shared task: Najar et al. (2026) — AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems
Architecture: Nacar, Deema, & Mohammed (2026) — AISA: A Unified Architecture for Agentic AI Systems
Methodology: Nacar et al. (2026) — From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Timeline

Important dates.

تواريخ مهمّة.

All deadlines are AoE.

May 16, 2026

Task launch

Shared task website live · registration opens · leaderboard online

June 1, 2026

Training / development data · baseline code · evaluation scripts released

Hugging Face dataset goes public · baseline model code published · official scorer released open-source

July 20, 2026

Registration deadline · Blind test data released

Last day to register your team on Hugging Face. Test set distributed to registered teams.

July 30, 2026

Final results released

Leaderboard published. Per-track and per-dialect breakdowns.

August 22, 2026

Camera-ready system description papers due

Participants submit final system papers describing their approach.

September 1, 2026

Shared task overview paper due

Organizers' overview paper covering task, methodology, and results.

September 10, 2026

Conference camera-ready deadline

Final paper revisions due for the proceedings.

October 24–29, 2026

ArabicNLP 2026 / EMNLP 2026 — Budapest, Hungary

Presentation at the Fourth Arabic Natural Language Processing Conference, co-located with EMNLP 2026.

Required Citations

Cite these three works.

يُرجى الاستشهاد بهذه الأعمال الثلاثة.

All system description papers must cite the shared task, the AISA architecture, and the methodology paper. Click any block to copy.

01 · Shared Task

AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems

Najar, Al Khalifa & Alzaharani · ArabicNLP 2026 · Budapest

@inproceedings{najar2026aisaarabicfc,
  title     = {{AISA-ArabicFC}: Arabic Function Calling for Agentic AI Systems},
  author    = {Najar, Omar and Al Khalifa, Mohammed and Alzaharani, Saeed},
  booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
  year      = {2026},
  address   = {Budapest, Hungary},
  publisher = {Association for Computational Linguistics}
}

02 · Architecture

AISA: A Unified Architecture for Agentic AI Systems

Nacar, Deema & Mohammed · Zenodo 2026 · 10.5281/zenodo.18161880

@misc{nacar2026aisa,
  title     = {{AISA}: A Unified Architecture for Agentic AI Systems},
  author    = {Nacar, Omer and Deema, A. and Mohammed, A.},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18161880},
  url       = {https://doi.org/10.5281/zenodo.18161880}
}

03 · Methodology

From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Nacar, Alquffari, Alsharideh, AlOtaibi et al. · arXiv:2603.16901 · 2026

@article{nacar2026language,
  title   = {From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning},
  author  = {Nacar, Omer and Alquffari, Deema and Alsharideh, Saleh and AlOtaibi, Adeem and Alabdulkarim, Abdulaziz and Alhazmi, Leen and Alomar, Nada and Alzubaidi, Wareef and Alsultan, Nada and Alrabghi, Ahmed and others},
  journal = {arXiv preprint arXiv:2603.16901},
  year    = {2026}
}

Teach machines to act in Arabic.

A unified reference architecture for agentic AI systems.

What is AISA?

Architecture Overview

Pick your challenge.

Function Call Detection & Selection

Reasoning-Augmented Calling

Cross-Dialect Robustness

12,125 Arabic queries. Real domains. Real dialects.

↓ Load it in 3 lines

i Dialect breakdown

Eight service domains

How submissions are scored.

What the field looks like today.

Argument extraction is the core challenge

Small fine-tuned models beat GPT-4o

Dialect gaps are significant

$1,000 for the top three systems.

From registration to leaderboard to paper.

Register your team

Get data, baselines & eval

Climb the dev leaderboard

Test set & paper

Important dates.

Everything you need to compete.

Get the data

Beat the baseline

Evaluate & submit

Organized by Tuwaiq Academy.

Cite these three works.