Arabic Function Calling for Agentic AI Systems — the first shared task benchmarking tool-use in Arabic across five dialects, eight real-world domains, and 27 structured tools.
Function calling — the bridge between language models and the real world — has exploded in English. Models can book flights, query databases, and chain tools into autonomous agents. For Arabic, with its rich dialectal variation and morphological complexity, this capability barely exists.
AISA-ArabicFC closes that gap. Given an Arabic query in any of five dialects, your system must decide whether a tool call is needed, choose the right function from a candidate set, and extract structured arguments — optionally producing an Arabic reasoning trace.
A user in Cairo says "عايز أحجز دكتور". A user in Riyadh says "أبي أحجز موعد عند الدكتور". Same intent. Same tool. Your model has to know.
A reference architecture proposed by Tuwaiq Academy — the foundation behind AISA-ArabicFC.
AISA is an implementation-neutral reference architecture for building agentic AI systems. It formalizes how reasoning, execution, infrastructure, evaluation, and governance interact to produce reliable, scalable, and auditable agent behavior.
AISA defines six interacting layers — intelligence, cognition, tooling, evaluation, deployment, and governance — that together describe how a complete agentic system is designed, run, and audited.
Submit to Track A, Track B, or both. Every submission is automatically scored on Track C for dialect robustness.
The flagship track. Given an Arabic query and candidate tools, decide / select / extract.
Everything in Track A, plus an Arabic reasoning trace inside <think> blocks before the tool call.
Not a separate submission — an automatic dialect-stratified breakdown of your Track A or B results.
Hosted on Hugging Face at TuwaiqAcademy/AISA-ArabicFC. Training and development splits available now — blind test set released July 20, 2026.
| Split | Samples | Notes |
|---|---|---|
| Train | 10,550 | Available now |
| Dev | 525 | Available now |
| Test (blind) | 1,050 | Released July 20 |
| Positive samples | 12,000 | Tool call required |
| Negative samples | 125 | No-call cases |
| Reasoning traces | 12,000 | For Track B |
| Dialect | % |
|---|---|
| Modern Standard Arabic (MSA) | 58.3% |
| Levantine | 16.9% |
| Egyptian | 12.2% |
| Gulf | 11.3% |
| Maghrebi | 1.3% |
Real-world Arabic services. Tools designed by domain experts.
Two metrics, one weighted composite per track. Argument extraction is weighted higher because it's the primary challenge. Official evaluation script will be released open-source.
"none" for negatives — this also penalises hallucinated calls).Four systems evaluated on the held-out test set. The takeaway: function selection is approachable. Argument exact match is wide open.
| System | FnAcc | ArgEM ★ | Overall (A) | Overall (B) |
|---|---|---|---|---|
|
AISA-Think
Gemma 3 (270M) + LoRA · reasoning-augmented
|
0.982 | 0.541 | 0.717 | 0.739 |
|
GPT-4o
Zero-shot prompting
|
0.927 | 0.070 | 0.413 | 0.313 |
|
GPT-4o
3-shot prompting
|
0.854 | 0.122 | 0.415 | 0.317 |
|
Random
Random tool from candidate set
|
0.047 | 0.033 | 0.039 | 0.031 |
Best ArgEM is 0.541; GPT-4o achieves 0.070. Hard cases include date format normalization, numeric type handling, and dialectal argument phrasing.
A 270M LoRA model outperforms GPT-4o across every metric — showing the value of task-specific Arabic training over scale alone.
FnAcc varies by up to 17.8pp across dialects. Gulf and Levantine Arabic are consistently the hardest — Track C surfaces this directly.
Cash prizes awarded to the best-performing systems on the final test set leaderboard. Open to every registered team worldwide — no fees, no nationality requirements.
The shared task runs in four clear phases. Registration opens first, then the data drops, then teams iterate on the live dev leaderboard, then the blind test settles the final ranking.
Registration is open now · sign up your team — name, leader, registered email, HF account. Locks you in to submit on the leaderboard from day one. Closes July 20.
Open the form →
Released June 1 · we drop the training + dev splits on Hugging Face, the official evaluation script, and the baseline model code. Pick any approach — fine-tuned LLMs, prompting, retrieval, hybrid — and get building.
June 1 → July 20 · iterate freely. Upload predictions to the live leaderboard, get auto-scored on the dev split, watch your rank update. Submit as many system variants as you want — the dev board is your sandbox.
July 20 → Aug 22 · the blind test set drops July 20. Run your best system, submit final predictions. Results published July 30. Camera-ready system description paper due August 22.
All participating system description papers must cite the three references below. See the Citation section for ready-to-use BibTeX entries.
All deadlines are AoE.
Grouped the way you'll use them: build, read, get help.
Train + dev splits are live. Blind test arrives July 20.
Open dataset → Step 2AISA-Think — Gemma 3 (270M) + LoRA. 0.541 ArgEM is the bar.
Open model → Step 3Upload dev predictions, get scored instantly, land on the live leaderboard.
Open leaderboard →A multi-disciplinary team building Arabic-native AI capabilities.
All system description papers must cite the shared task, the AISA architecture, and the methodology paper. Click any block to copy.
@inproceedings{najar2026aisaarabicfc, title = {{AISA-ArabicFC}: Arabic Function Calling for Agentic AI Systems}, author = {Najar, Omar and Al Khalifa, Mohammed and Alzaharani, Saeed}, booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)}, year = {2026}, address = {Budapest, Hungary}, publisher = {Association for Computational Linguistics} }
@misc{nacar2026aisa, title = {{AISA}: A Unified Architecture for Agentic AI Systems}, author = {Nacar, Omer and Deema, A. and Mohammed, A.}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.18161880}, url = {https://doi.org/10.5281/zenodo.18161880} }
@article{nacar2026language, title = {From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning}, author = {Nacar, Omer and Alquffari, Deema and Alsharideh, Saleh and AlOtaibi, Adeem and Alabdulkarim, Abdulaziz and Alhazmi, Leen and Alomar, Nada and Alzubaidi, Wareef and Alsultan, Nada and Alrabghi, Ahmed and others}, journal = {arXiv preprint arXiv:2603.16901}, year = {2026} }