Accepted Shared Task · ArabicNLP 2026
مهمة مشتركة لاستدعاء الدوال العربية في أنظمة الذكاء الاصطناعي التوكيلي

AISA-ArabicFC

Arabic Function Calling for Agentic AI Systems — the first shared task benchmarking tool-use in Arabic across five dialects, eight real-world domains, and 27 structured tools.

Oct 24–29, 2026 Budapest, Hungary Co-located with EMNLP 2026 Open to teams worldwide Arabic NLP Conference

Teach machines to act in Arabic.

علّم الآلات أن تتصرف بالعربية.

Function calling — the bridge between language models and the real world — has exploded in English. Models can book flights, query databases, and chain tools into autonomous agents. For Arabic, with its rich dialectal variation and morphological complexity, this capability barely exists.

AISA-ArabicFC closes that gap. Given an Arabic query in any of five dialects, your system must decide whether a tool call is needed, choose the right function from a candidate set, and extract structured arguments — optionally producing an Arabic reasoning trace.

A user in Cairo says "عايز أحجز دكتور". A user in Riyadh says "أبي أحجز موعد عند الدكتور". Same intent. Same tool. Your model has to know.

12,125
total samples
5
Arabic dialects
27
structured tools
8
service domains

A unified reference architecture for agentic AI systems.

بنية مرجعية موحّدة لأنظمة الذكاء الاصطناعي التوكيلي.

A reference architecture proposed by Tuwaiq Academy — the foundation behind AISA-ArabicFC.

Design Evaluation Governance Deployment
01 · CONCEPT

What is AISA?

AISA is an implementation-neutral reference architecture for building agentic AI systems. It formalizes how reasoning, execution, infrastructure, evaluation, and governance interact to produce reliable, scalable, and auditable agent behavior.

02 · STRUCTURE

Architecture Overview

AISA defines six interacting layers — intelligence, cognition, tooling, evaluation, deployment, and governance — that together describe how a complete agentic system is designed, run, and audited.

Cite: Nacar, O., Deema, A., & Mohammed, A. (2026). AISA: A Unified Architecture for Agentic AI Systems. Zenodo. https://doi.org/10.5281/zenodo.18161880

Pick your challenge.

اختر التحدي الذي يناسبك.

Submit to Track A, Track B, or both. Every submission is automatically scored on Track C for dialect robustness.

TRACK A · CORE

Function Call Detection & Selection

The flagship track. Given an Arabic query and candidate tools, decide / select / extract.

  • Binary call decision
  • Function name selection
  • Argument extraction (JSON)
  • Open to all model architectures
TRACK B · REASONING

Reasoning-Augmented Calling

Everything in Track A, plus an Arabic reasoning trace inside <think> blocks before the tool call.

  • All Track A requirements
  • Arabic reasoning generation
  • Reasoning-action consistency scored
  • For interpretable agentic AI
TRACK C · DIAGNOSTIC

Cross-Dialect Robustness

Not a separate submission — an automatic dialect-stratified breakdown of your Track A or B results.

  • MSA · Gulf · Egyptian
  • Levantine · Maghrebi
  • Per-dialect FnAcc + ArgEM
  • Dialect gap score (max−min)

12,125 Arabic queries. Real domains. Real dialects.

١٢٬١٢٥ استعلامًا عربيًا · مجالات حقيقية · لهجات حقيقية.

Hosted on Hugging Face at TuwaiqAcademy/AISA-ArabicFC. Training and development splits available now — blind test set released July 20, 2026.

SplitSamplesNotes
Train10,550Available now
Dev525Available now
Test (blind)1,050Released July 20
Positive samples12,000Tool call required
Negative samples125No-call cases
Reasoning traces12,000For Track B

Load it in 3 lines

# pip install datasets from datasets import load_dataset ds = load_dataset("TuwaiqAcademy/AISA-ArabicFC") print(ds["train"][0])

i Dialect breakdown

Dialect%
Modern Standard Arabic (MSA)58.3%
Levantine16.9%
Egyptian12.2%
Gulf11.3%
Maghrebi1.3%

Eight service domains

Real-world Arabic services. Tools designed by domain experts.

🏥 Healthcare
book_doctor_appointment · search_medications · check_insurance_coverage
🏦 Banking & Finance
transfer_money · convert_currency · calculate_customs
🏛️ Government Services
check_visa_status · check_iqama_status · check_traffic_violations
🕌 Islamic Services
get_qibla_direction · calculate_zakat · search_quran · calculate_inheritance
✈️ Travel
search_hotels · search_umrah_packages
🌤️ Weather & Environment
get_weather · get_air_quality
🛒 E-commerce
compare_prices · order_food
🔧 Utilities
translate_text · calculate_end_of_service

How submissions are scored.

كيفية تقييم المشاركات.

Two metrics, one weighted composite per track. Argument extraction is weighted higher because it's the primary challenge. Official evaluation script will be released open-source.

Track A
Overall = 0.40 · FnAcc  +  0.60 · ArgEM
Track B
Overall = 0.30 · FnAcc  +  0.50 · ArgEM  +  0.20 · ThinkRate
0.60 / 0.50
ArgEM — Argument Exact Match ★
Strict match of all predicted argument key-value pairs. The headline metric — small fine-tuned models top out around 0.541, GPT-4o sits at 0.070. Massive room to improve.
0.40 / 0.30
FnAcc — Function Name Accuracy
Exact match of the predicted function name (or "none" for negatives — this also penalises hallucinated calls).
0.20
ThinkRate — Track B only
Did the system produce an Arabic reasoning trace before the tool call? Best baseline: 0.868.
Diagnostic
Dialect gap (Track C)
Per-dialect FnAcc + ArgEM, plus the max−min gap. Reported for analysis, not ranking. Gulf and Levantine are the hardest.

What the field looks like today.

واقع الأداء في المجال اليوم.

Four systems evaluated on the held-out test set. The takeaway: function selection is approachable. Argument exact match is wide open.

System FnAcc ArgEM ★ Overall (A) Overall (B)
AISA-Think
Gemma 3 (270M) + LoRA · reasoning-augmented
0.982 0.541 0.717 0.739
GPT-4o
Zero-shot prompting
0.927 0.070 0.413 0.313
GPT-4o
3-shot prompting
0.854 0.122 0.415 0.317
Random
Random tool from candidate set
0.047 0.033 0.039 0.031
Track B Think-Before-Call rate: AISA-Think 0.868, others 0.000.
🎯

Argument extraction is the core challenge

Best ArgEM is 0.541; GPT-4o achieves 0.070. Hard cases include date format normalization, numeric type handling, and dialectal argument phrasing.

Small fine-tuned models beat GPT-4o

A 270M LoRA model outperforms GPT-4o across every metric — showing the value of task-specific Arabic training over scale alone.

🌍

Dialect gaps are significant

FnAcc varies by up to 17.8pp across dialects. Gulf and Levantine Arabic are consistently the hardest — Track C surfaces this directly.

$1,000 for the top three systems.

جوائز نقدية للأنظمة الثلاثة الأولى.

Cash prizes awarded to the best-performing systems on the final test set leaderboard. Open to every registered team worldwide — no fees, no nationality requirements.

🥈
2nd Place
$300
Cash Prize
🥇
1st Place · Winner
$500
Cash Prize
🥉
3rd Place
$200
Cash Prize
$1,000 total
Ranking is based on the Track A Overall score on the blind test set (released July 20). Winners announced July 30 and recognised at ArabicNLP 2026 in Budapest.

From registration to leaderboard to paper.

من التسجيل إلى لوحة الصدارة إلى الورقة العلمية.

The shared task runs in four clear phases. Registration opens first, then the data drops, then teams iterate on the live dev leaderboard, then the blind test settles the final ranking.

01

Register your team

Registration is open now · sign up your team — name, leader, registered email, HF account. Locks you in to submit on the leaderboard from day one. Closes July 20.
Open the form →

02

Get data, baselines & eval

Released June 1 · we drop the training + dev splits on Hugging Face, the official evaluation script, and the baseline model code. Pick any approach — fine-tuned LLMs, prompting, retrieval, hybrid — and get building.

03

Climb the dev leaderboard

June 1 → July 20 · iterate freely. Upload predictions to the live leaderboard, get auto-scored on the dev split, watch your rank update. Submit as many system variants as you want — the dev board is your sandbox.

04

Test set & paper

July 20 → Aug 22 · the blind test set drops July 20. Run your best system, submit final predictions. Results published July 30. Camera-ready system description paper due August 22.

📝
Registration is open — sign up your team now
Open the official Tuwaiq Academy registration form: tuwaiq.edu.sa/form/rL7Bl3wq. Closes July 20.
Open form →
Required citations for system description papers

All participating system description papers must cite the three references below. See the Citation section for ready-to-use BibTeX entries.

  • Shared task: Najar et al. (2026) — AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems
  • Architecture: Nacar, Deema, & Mohammed (2026) — AISA: A Unified Architecture for Agentic AI Systems
  • Methodology: Nacar et al. (2026) — From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Important dates.

تواريخ مهمّة.

All deadlines are AoE.

May 16, 2026
Task launch
Shared task website live · registration opens · leaderboard online
June 1, 2026
Training / development data · baseline code · evaluation scripts released
Hugging Face dataset goes public · baseline model code published · official scorer released open-source
July 20, 2026
Registration deadline · Blind test data released
Last day to register your team on Hugging Face. Test set distributed to registered teams.
July 30, 2026
Final results released
Leaderboard published. Per-track and per-dialect breakdowns.
August 22, 2026
Camera-ready system description papers due
Participants submit final system papers describing their approach.
September 1, 2026
Shared task overview paper due
Organizers' overview paper covering task, methodology, and results.
September 10, 2026
Conference camera-ready deadline
Final paper revisions due for the proceedings.
October 24–29, 2026
ArabicNLP 2026 / EMNLP 2026 — Budapest, Hungary
Presentation at the Fourth Arabic Natural Language Processing Conference, co-located with EMNLP 2026.

Everything you need to compete.

كل ما تحتاجه للمشاركة في المنافسة.

Grouped the way you'll use them: build, read, get help.

Organized by Tuwaiq Academy.

بتنظيم من أكاديمية طويق.

A multi-disciplinary team building Arabic-native AI capabilities.

ON
Omer Nacar
Tuwaiq Academy
MK
Mohammed Al Khalifa
Tuwaiq Academy
SA
Saeed Alzaharani
Tuwaiq Academy

Cite these three works.

يُرجى الاستشهاد بهذه الأعمال الثلاثة.

All system description papers must cite the shared task, the AISA architecture, and the methodology paper. Click any block to copy.

01 · Shared Task
AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems
Najar, Al Khalifa & Alzaharani · ArabicNLP 2026 · Budapest
@inproceedings{najar2026aisaarabicfc,
  title     = {{AISA-ArabicFC}: Arabic Function Calling for Agentic AI Systems},
  author    = {Najar, Omar and Al Khalifa, Mohammed and Alzaharani, Saeed},
  booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
  year      = {2026},
  address   = {Budapest, Hungary},
  publisher = {Association for Computational Linguistics}
}
02 · Architecture
AISA: A Unified Architecture for Agentic AI Systems
Nacar, Deema & Mohammed · Zenodo 2026 · 10.5281/zenodo.18161880
@misc{nacar2026aisa,
  title     = {{AISA}: A Unified Architecture for Agentic AI Systems},
  author    = {Nacar, Omer and Deema, A. and Mohammed, A.},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18161880},
  url       = {https://doi.org/10.5281/zenodo.18161880}
}
03 · Methodology
From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning
Nacar, Alquffari, Alsharideh, AlOtaibi et al. · arXiv:2603.16901 · 2026
@article{nacar2026language,
  title   = {From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning},
  author  = {Nacar, Omer and Alquffari, Deema and Alsharideh, Saleh and AlOtaibi, Adeem and Alabdulkarim, Abdulaziz and Alhazmi, Leen and Alomar, Nada and Alzubaidi, Wareef and Alsultan, Nada and Alrabghi, Ahmed and others},
  journal = {arXiv preprint arXiv:2603.16901},
  year    = {2026}
}