Alif: Technical Deep Dive
A single-user Arabic learning system built with spaced repetition, Arabic NLP, and LLMs
(This overview was generated by Claude. The code is here. If you end up experimenting with it, I'd love to hear from you.)
Overview
Alif is a sentence-first spaced repetition system for Arabic reading and listening comprehension, built for exactly one learner — me. It's a mobile app (Expo React Native) backed by a Python API (FastAPI + SQLite), running on a single Hetzner VPS for about €4/month. The whole thing was built using AI pair programming with Claude Code, starting from zero on February 8, 2026.
By the numbers: 70 backend Python files, 19 frontend screens, 27 lib modules, 2,060 tests, 610 git commits across 80+ Claude Code conversations. The development happened through about 750 user messages — mostly short directives steering Claude through long implementation sessions, with the longest single conversation running to 110+ messages. The codebase includes an extensive research folder with algorithm design documents, Arabic linguistics notes, experiment logs, and analysis reports.
This article covers everything: the algorithms, the Arabic NLP pipeline, the LLM integration, the frontend design, the data and experimentation, and the development story. It's long — grab a coffee.
Architecture & Stack
Backend
Framework: FastAPI (Python 3.11+) Database: SQLite with WAL (Write-Ahead Logging) for concurrent reads ORM: SQLAlchemy 2.0+ with Alembic migrations (run automatically at startup) Deployment: Docker container on Hetzner VPS, uvicorn on port 8000 Background tasks: Python threading (no external message queue)
SQLite with WAL mode works surprisingly well for a single-user app. Every connection uses:
PRAGMA journal_mode=WAL
PRAGMA busy_timeout=15000
PRAGMA synchronous=NORMAL
PRAGMA foreign_keys=ONFrontend
Framework: Expo React Native (New Architecture enabled), React 19 Navigation: Expo Router 6 (file-based routing) Arabic typography: Scheherazade New (serif, excellent diacritical support) English typography: Noto Sans Theme: Dark theme with semantic color coding (green=known, orange=learning, blue=acquiring, red=missed, yellow=confused)
LLM Strategy (Two-Tier)
Background processing: Claude CLI via Max subscription (free) — Sonnet for sentence generation, Haiku for quality gate + enrichment + mapping verification User-facing: Gemini Flash (fast, ~1s) for on-demand tasks, story generation Fallback chain: Gemini Flash → GPT-5.2 → Claude Haiku API Podcast/Story TTS: ElevenLabs with PVC voice clone, 3-voice pool for story rotation
The Learning Algorithm
Word Lifecycle
Every Arabic word passes through a defined lifecycle:
new (imported into system)
↓
encountered (seen passively in sentences, not yet studied)
↓
acquiring (active study in Leitner boxes)
↓
learning (graduated to FSRS, stability < 1 day)
↓
known (FSRS Review state, stability ≥ 1 day)
↓
lapsed (failed reviews → relearning)
↓ ↑
[recovers back to known]
suspended (leech: ≥5 reviews, <50% accuracy)
↓ [cooldown: 3/7/14 days]
acquiring (reintroduced)Words enter the system from multiple sources: Duolingo import (95 words initially), OCR textbook scans (926 words — the largest single source), book imports (315), auto-introduction based on frequency (111), Quran verse lemmatization (66), leech reintroductions (49), collateral credit (49), manual selection in Learn mode (46), mapping corrections (22), story imports (21), and flag-triggered auto-creation (9).
Phase 1: Acquisition (Leitner Boxes)
The core insight: spaced repetition (FSRS, Anki, etc.) is designed for remembering, not learning. FSRS has no native learning phase — it assumes the card is already known. Anki's learning steps (1m/10m) happen outside the FSRS algorithm entirely.
Alif handles first-encounter learning with a 3-box Leitner system:
Box 1 — 4 hours (encoding, can advance within same session) Box 2 — 1 day (sleep consolidation, enforces inter-session spacing) Box 3 — 3 days (long-term consolidation, must meet graduation criteria)
Advancement rules:
Rating ≥ 3 (good/easy): Advance box (respecting due-date gating for box 2+)
Rating 2 (hard): Stay in same box, reset interval
Rating 1 (again): Reset to box 1, retry in 5-10 minutes
Graduation is tiered, added after discovering ~1,465 excess reviews on words with ≥95% accuracy:
Tier 0: First review correct → instant graduation (words already known from other sources) Tier 1: 100% accuracy + 3+ reviews → any box (fast learners) Tier 2: ≥80% accuracy + 4+ reviews + box ≥ 2 (solid learners) Tier 3: Box ≥ 3 + 5+ reviews + ≥60% accuracy + 2+ calendar days (standard path)
A batch graduation script immediately graduated 41 stuck perfect-accuracy words when this was deployed.
Phase 2: Long-Term Maintenance (FSRS-6)
After graduating from Leitner, words enter FSRS-6 (the modern spaced repetition algorithm used by Anki 23.10+). Each word gets a card with:
Stability: days until 90% retention probability drops to 50% Difficulty: 1-10 scale (inverse of easiness) State: Learning → Review → Relearning
FSRS handles all long-term scheduling. Retention target is ~90% (FSRS-6 default).
Root-Aware Stability Boost
On graduation from acquisition to FSRS, the system counts how many known words share the same root. If ≥2 siblings are already known (e.g., you know كِتَاب kitāb "book" and كَاتِب kātib "writer", and now you're graduating مَكْتَب maktab "office"), the first FSRS review is rated "easy" instead of "good" — resulting in longer initial intervals.
This leverages Arabic's root-family structure: knowing siblings means the root is already anchored in memory.
Leech Detection & Recovery
Detection: sliding window of last 8 reviews, accuracy < 50% (switched from cumulative accuracy after discovering that cumulative tracking made recovery mathematically near-impossible)
Auto-suspension with graduated cooldowns:
1st leech: 3-day cooldown
2nd leech: 7-day cooldown
3rd+ leech: 14-day cooldown
On reintroduction: Stats are partially reset (times_seen = max(3, times_seen // 2)) — this was critical because the original implementation had a 0% recovery rate. The math made it essentially impossible: a word seen 10 times with 3 correct (30% accuracy) would need 4 consecutive correct reviews just to reach 50% — one failure and it's re-leeched. The partial reset gives words a fighting chance.
When a word first fails, the system generates premium memory hooks: 3 candidate mnemonics with different keyword associations, self-evaluated, best one selected.
Intro Cards (A/B Experiment Concluded)
Every new word now gets a rich info card before its first sentence review. This was validated by a five-week A/B experiment across 264 words (142 card-first vs 122 sentence-first):
First-review accuracy: 65% vs 37% (+28pp) Graduation rate: 73% vs 57% Median time to graduate: 11h vs 26h (2.4x faster) Median reviews to graduate: 5 vs 7 Post-graduation FSRS accuracy: 95% vs 96% (similar)
Rescue cards: Acquiring words with ≥4 reviews and <50% accuracy get a re-teaching intro card (7-day cooldown between rescue cards). These target words that are stuck in the acquisition pipeline.
Auto-Introduction (Adaptive)
The system reserves 20% of session slots for new words, even when the due queue exceeds the session limit. The rate adapts to recent accuracy:
< 70% accuracy → 0 new words (struggling — pause introductions) 70-85% → 3 new words ≥ 85% → up to 5 new words
A pipeline backlog gate suppresses introductions when too many words are in the acquiring pipeline, but the threshold is dynamic — it scales with 2-day accuracy: ≥90% accuracy → 80 words allowed, ≥80% → 60, <80% → 40. This prevents high-performing learners from being starved of new material.
Session Assembly
This is where Alif's sentence-first approach gets technically interesting. The system needs to select ~10-20 sentences per session that maximize coverage of due words while maintaining comprehensibility.
The Set-Cover Algorithm
Phase 1: Identify due words. Collect all words in states (acquiring, learning, known, lapsed) whose review is due.
Phase 2: Auto-introduction. If the session would be undersized, add new words (rate determined by accuracy gates above).
Phase 3: Score candidate sentences. For each sentence containing at least one due word:
score = due_coverage^1.5
× difficulty_match [0.3 - 1.0]
× grammar_fit [0.8 - 1.1]
× diversity [1/(1 + times_shown)]
× scaffold_freshness [0.1 - 1.0]
× source_bonus [1.0 or 1.3 for books]
× session_diversity [0.5^reuse_count]
× rescue_penalty [0.3 if recently shown]Key scoring components:
Due coverage (exponent 1.5) — sentences covering more due words are superlinearly preferred Difficulty match — do the non-target "scaffold" words match the weakest due word's stability? (Avoids overwhelming sentences) Scaffold freshness — geometric mean penalty for over-reviewed scaffold words, prevents the same familiar sentences from being recycled Session diversity — penalizes reusing the same scaffold words within a session (decay = 0.5 per reuse) Source bonus — book/OCR sentences get 1.3x (real text preferred over generated)
Comprehensibility gate: Reject any sentence where less than 60% of scaffold words are known. "Known" here means state ∈ {known, learning, lapsed} or acquiring with stability ≥ 0.5. Encountered words and function words don't count.
Phase 4: Greedy cover loop. Pick the highest-scoring sentence, add it to the session, remove its due words from the remaining set, rescore all candidates (accounting for session diversity), repeat until the session is full or no more due words remain.
Recency cutoffs prevent showing the same sentence too soon:
Understood: 1-day window
Partial: 4-hour window
No idea: 30-minute window
Sessions build entirely from pre-generated sentences — no LLM calls in the critical path (<1s build time). A background warm_sentence_cache() runs after every session load to generate material for upcoming reviews, and a cron job runs every 3 hours to backfill gaps.
Within-Session Word Repetition
Acquisition-phase words appear multiple times within a session at expanding positions (N, N+3, N+7), based on research showing expanding spacing (lag 0-1-5-9) is effective for initial acquisition.
Arabic NLP Pipeline
CAMeL Tools
The backbone of Arabic morphological analysis. CAMeL Tools (from NYU Abu Dhabi) provides:
Morphological analysis: for each word, returns all possible analyses (lemma, root, POS, gender, number, state, enclitics) MLE disambiguation: when multiple analyses exist, picks the most likely based on a pretrained model. Loaded lazily as a singleton; falls back to stub if unavailable.
Example output for a single word:
{
"lex": "كِتَاب", # base lemma with diacritics
"root": "ك.ت.ب", # trilateral root
"pos": "noun",
"enc0": "", # enclitic (pronoun suffix)
"num": "singular",
"gen": "masculine",
"stt": "d", # state: determined
}Clitic Stripping
Arabic attaches prepositions, conjunctions, and possessive pronouns directly to words. وَكِتَابُكَ (wa-kitābu-ka) is a single written word meaning "and your book" — three morphemes fused together.
The system handles both proclitics (prefixes) and enclitics (suffixes):
Proclitics: ال (the), بال (with the), لال (for the), و (and), ف (so/then) Enclitics: -ي (my), -ك (your), -ه (his), -ها (her), -هم (their), etc.
CAMeL Tools identifies these, and the system strips them to find the core lemma. This means كِتَاب only needs to be learned once, regardless of what's attached to it.
Verb Conjugation Recognition
A critical bug at week 3: 82% of generated sentences were being rejected by the validator. Root cause: the comprehensibility gate didn't recognize conjugated verb forms as "known words." The word يَكْتُبُ (yaktub, "he writes") wasn't being matched to the lemma كَتَبَ (katab, "wrote").
Fix: algorithmic generation of ~33 conjugation forms per verb (present tense across all persons/genders/numbers, past tense forms, masdar, participles, imperative). These are generated by the LLM during word enrichment and stored in forms_json. The validator now checks surface forms against all known forms.
Lemmatization Feedback Loop
The system operates at four layers:
**Generation-time correction**: `verify_and_correct_mappings_llm()` catches wrong mappings before storage. If the correct lemma exists in the DB, the mapping is fixed. If not, the sentence is rejected — the system never auto-creates lemmas from corrections (this was a hard-won principle after orphan lemmas bypassed quality gates and ended up as review targets).
**Background batch verification**: After every session, `warm_sentence_cache` Phase 4 checks existing sentences via batched LLM calls. Unfixable sentences are retired.
**User flag resolution**: When I flag a wrong mapping, it's fixed if possible, propagated to other sentences with the same error (LLM-verified, max 50), or the sentence is retired.
**Disambiguation**: `disambiguate_mappings_llm()` resolves ambiguous tokens using sentence context at generation time.
Homograph-aware correction: A subtle bug went undetected for weeks — correct_mapping() used .first() to find lemmas by bare form. When two lemmas share the same consonantal skeleton (سلم "peace" vs سلم "ladder"), it found the same wrong lemma and silently concluded "already correct." Now it searches all matches and picks a different one.
Multi-hop variant chains: Variant lemmas (الكلب → كلب → canonical form) form chains that must be followed to the root in every code path: review credit, session building, story knowledge maps, word introduction priority. A single-hop implementation caused a bug where غرفة appeared as a "new word" despite 37 reviews of its canonical — because the intermediate variant was itself a variant.
Function Words
About 100+ words (prepositions, pronouns, conjunctions, particles) are classified as function words. They appear freely in sentences as scaffold but are excluded from the learning pipeline — you pick them up naturally through sentence exposure rather than drilling them explicitly.
LLM Integration
Sentence Generation
The primary content creation pipeline. Sentences are generated by LLM with strict vocabulary constraints:
**System prompt** specifies style rules, naturalness requirements, and MSA standards
**Vocabulary constraint**: Only use provided known words + target word(s) + function words
**Validation**: CAMeL Tools analysis checks every word against the vocabulary constraint
**Retry loop**: Up to 7 retries with feedback on which words violated the constraint
**Rejected words** are added to an `avoid_words` list for subsequent retries
Batch generation (3+ sentences per LLM call) is more efficient and produces more diverse output. Multi-target generation creates sentences containing ≥2 target words for cohesive material.
Story Generation
Longer-form content using only known vocabulary — pure fluency practice. Stories now come in 4 formats:
Standard: 6-10 sentence narrative with full diacritics Long: 12-20 sentences for more sustained reading Breakdown: sentences designed to be splittable — audio plays half-sentence, then full, then full story Arabic explanation: simple Arabic explanations of each sentence — no English at all (A1-level comprehensible input in Arabic)
Story audio uses ElevenLabs TTS with a pool of 3 male voices, deterministically rotated by story_id % pool_size. An archive system lets me mark stories as "done but keep for re-listening" separately from completion status. times_heard tracks passive listening without triggering FSRS reviews. A cron step auto-generates stories to keep ≥3 active non-archived ones available.
A benchmark of 32 stories across 4 models and 4 strategies found:
Gemini Flash — quality 3.0, naturalness 2.8, $0.001/story, 6s GPT-5.2 — quality 3.0, naturalness 1.6, $0.018/story, 19s Claude Opus — quality 4.0, naturalness 3.2, $0.155/story, 25s Claude Sonnet — quality 4.0, naturalness 3.8, $0.027/story, 19s
Best single story: "The Old Message" (Sonnet + two-pass strategy) — emotional, coherent, sophisticated vocabulary, perfect diacritics. The story system prompt emphasizes craft: humor, suspense, twists, poetry, or warmth; concrete details over abstractions; proper names used sparingly; "the last sentence matters most — land the ending."
Memory Hook Generation
When a word first fails (or is detected as a leech), the system generates mnemonic keyword associations:
Standard (first failure): Single keyword mnemonic with collocations, cognates, usage context.
Premium (leech): Three candidate mnemonics with different keywords, self-evaluated on sound match, interaction quality, and meaning extractability. Best candidate selected.
Example output for كِتَاب (kitāb, "book"):
{
"mnemonic": "you see a CAT open a TAB... start reading a BOOK",
"cognates": [{"lang": "Hindi", "word": "किताब (kitab)", "note": "direct borrowing"}],
"collocations": [{"ar": "كِتَابُ مِفْتَاحٍ", "en": "key book"}],
"usage_context": "In headlines: كتاب جديد (new book) released",
"fun_fact": "Arabic kitāb is borrowed into Indonesian, Hindi, Turkish, Swahili..."
}Word & Root Enrichment
Each word gets enriched with:
Etymology — root meaning, pattern, derivation chain, semantic field, related loanwords, cultural notes Forms — all conjugation/inflection forms with transliteration Root enrichment — etymology story, cultural significance, literary examples, fun facts, related roots Pattern info — what the morphological pattern (wazn) means, how to recognize it, semantic fields, example derivations
Confusion Analysis
When a user marks a word "confused" (yellow — knows it but didn't recognize it), the system runs rule-based analysis (<50ms, no LLM calls):
Prefix disambiguation — is the first/last letter a prefix/suffix or part of the root? Visually similar words — character-level diff highlighting (which positions differ) Phonetically similar words — words that sound alike
Classification: "clitic_heavy", "conjugation_complex", "visually_similar".
The Sentence Generation Pipeline
This is where Alif gets technically interesting at scale. Each Arabic sentence passes through 7 stages — from word selection through LLM generation, deterministic validation, quality review, mapping verification, storage, and ongoing lifecycle management. A sentence that fails any gate is retried up to 7 times with feedback-driven correction.
The Full Pipeline
Word Selection → LLM Generation → 3-Pass Validation → Diversity Check → Quality Gate → Mapping Verify → Storage → Lifecycle
(tier priority) (Gemini/Claude) (deterministic) (scaffold reuse) (Claude Haiku) (LLM cross-check) (tier rotation)Design Principle: Generate-Then-Write
All generation functions avoid holding SQLite write locks during slow LLM calls (15-30s). The pattern: (1) read DB and commit/close → (2) call LLM with no DB lock → (3) reopen DB briefly for writes. This prevents "database is locked" cascades during concurrent access.
This was formalized into a strict 4-phase discipline after a cascade failure where three services (story import, OCR, lemma enrichment) all held write locks during LLM calls simultaneously — crashing chat and making session-end take 30-60 seconds. The fix restructured every service into: Phase 1 (read + commit), Phase 2 (LLM calls, no DB), Phase 3 (batch write, minimal lock time), Phase 4 (post-write LLM for variant detection/verification, with fresh lemma IDs). Lock hold time went from 15-90 seconds to <100ms.
Three-Pass Lemma Lookup
The validation is fully deterministic — no LLM involved. A sentence is valid iff the target word is found AND there are zero unknown words.
build_lemma_lookup() constructs a dictionary mapping normalized Arabic forms to lemma IDs in three passes:
**Direct bare forms**: Register each lemma's normalized bare form (diacritics stripped, alef normalized), plus ال-prefix variants.
**Forms from forms_json**: Index ALL string-valued keys from the lemma's enrichment data — past tenses, imperative, passive participle, sound plurals, everything.
**Algorithmic conjugation generation**: For each verb, generate ~36 conjugation forms by combining 4 present-tense prefixes × 6 suffix variants, plus 9 past-tense suffixes. Weak verb support uses `past_1s` from forms_json for irregular stems (قال→قلت, مشى→مشيت). For nouns: sound plurals (ـات/ـون/ـين) and dual forms.
Clitic stripping tests all combinations of proclitics (وال، بال، فال، لل، و، ف، ب، ل) and enclitics (هما، هم، هن، ها، كم، نا، ني، ه، ك), with taa marbuta restoration (مدرسته → مدرسة + ه). Clitic-derived mappings are tagged via_clitic=True for extra scrutiny during verification.
Mapping Verification & Correction
After deterministic validation, an LLM cross-check verifies that each word was mapped to the correct lemma. This catches the cases that pure morphological analysis gets wrong — homograph collisions (ذَهَب "gold" vs ذَهَبَ "to go"), clitic over-stripping (وَالأَدَبِ → دَبَّ "creep" instead of أَدَب "literature").
The key design decision: verify-then-correct instead of verify-then-discard. When the verifier finds a wrong mapping, it suggests the correct lemma. correct_mapping() fixes it in the database if the correct lemma exists. If not, the sentence is rejected — never patched with an auto-created lemma.
Previously, three different scripts had their own copy of the generation code that skipped this verification entirely. Analysis showed 20.4% of active sentences (136/667) were unverified. Unifying all generation through a single verified pipeline (generate_material_for_word()) and removing 467 lines of duplicated unverified code was one of the most impactful changes in the project's history.
System Prompt Engineering
The Arabic style rules are extensive (and were refined iteratively through real usage):
Mix VSO and SVO word order — VSO is more formal/classical, SVO more contemporary
No copula: never insert هُوَ as "is" with indefinite predicates (مُحَمَّدٌ طَبِيبٌ, NOT مُحَمَّدٌ هُوَ طَبِيبٌ)
Never start a nominal sentence with an indefinite noun — use definite or verb-first
Correct i'rab (case endings) with tanween on indefinites
No redundant pronouns — verb conjugation already encodes the subject
Semantic coherence in compound sentences (clauses joined by و/ثُمَّ/لَكِنَّ must be logically related)
The difficulty guide scales grammar complexity: beginners get SVO, short nominal sentences, basic connectors; intermediate gets relative clauses, negation with لَمْ/لَنْ, idafa chains; advanced gets VSO default, embedded clauses, classical particles.
A single change — expanding the known vocabulary sample from 50 to 500 words — improved validation compliance by 31 percentage points.
Tier-Based Lifecycle
Sentences have a lifecycle tied to when their target word is due for review:
Tier 1 (≤12h) — 3 target sentences, floor 2 Tier 2 (12-36h) — 2 target sentences, floor 1 Tier 3 (36-72h) — 1 target sentence, floor 0 Tier 4 (72h+) — 0, retired
This bounds the active sentence pool by review urgency (~200 tier 1-3 words), not by total vocabulary size. A cron job runs every 3 hours to backfill, and warm_sentence_cache() fills gaps after every session.
Content Pipeline
OCR Import
The bridge between physical textbooks and the app:
**Cover extraction**: Gemini Vision → title, author metadata
**Page OCR** (parallel per page): Gemini Vision → raw Arabic text
**LLM cleanup**: Diacritization, sentence splitting, cleaning
**Translation**: Sentence-by-sentence English translation
**Morphology**: CAMeL Tools → root, POS, lemma mapping per word
**Story creation**: Aggregate into story record with word-level tracking
Dark images are auto-enhanced (brightness/contrast) with retry if initial OCR yields <100 characters.
Reading Goals & Pre-Reading Clusters
Import any text → the app analyzes vocabulary coverage → auto-schedules missing words as priority → daily practice fills the gaps → come back and read.
Implementation: Stories track readiness_pct (percentage of vocabulary known, live-recalculated with multi-hop variant chain resolution). Each word has is_known_at_creation tracking whether it was known at import time. The dashboard shows active stories with completion percentage and remaining unknown words. A story that showed 23% readiness jumped to 99% after fixing the variant resolution — the remaining unknown word became the #1 introduction candidate.
Cold vs. warm unknowns: Not all unknown words are equally unknown. A "warm" unknown has ≥1 known root sibling in the learner's DB — e.g., you know كَتَبَ (katab, "to write") and encounter مَكْتَبَة (maktaba, "library") for the first time. Root-family knowledge provides ~50-70% semantic access (Boudelaa & Marslen-Wilson 2013). A "cold" unknown has no known root siblings — genuinely novel.
The reading readiness score accounts for this:
readiness_pct = (known_count + 0.6 × warm_unknowns) / total_words × 100The 0.6 coefficient reflects partial semantic access via root-family priming. The story detail screen shows a banner: "87% ready · 3 new · 2 familiar root."
Pretesting: A "Preview" button on the story screen shows the top 5 cold unknowns (sorted by token frequency within that story). Each word appears in Arabic for 2 seconds — a failed-retrieval attempt that research shows improves subsequent learning by 19 percentage points (Richland, Kornell & Kao 2009). After the preview: "Watch for them as you read."
Quran Reading Mode
A late addition that turned into one of the most technically interesting subsystems. The entire Quran (6,236 verses) is stored locally, and verses are scheduled for spaced reading practice alongside regular sentence reviews.
Verse SRS (Separate from FSRS)
Quran verses use a simpler 8-level SRS system rather than FSRS — appropriate because the goal is reading fluency (recognizing and understanding verses), not vocabulary acquisition (which happens through the regular pipeline):
Level 0 — unseen Level 1→2 — 4 hours (first encounter) Level 2→3 — 12 hours (same-day review) Level 3→4 — 1 day (next-day consolidation) Level 4→5 — 3 days (short-term retention) Level 5→6 — 7 days (weekly retention) Level 6→7 — 21 days (monthly retention) Level 7→8 — graduated
A backlog gate prevents introduction of new verses when >20 are in learning states, and new verse introductions are capped at 3 per day — keeps the review load manageable.
Interleaving with Sentence Reviews
Verse cards are interleaved into regular review sessions at ~8-card intervals. The frontend renders them distinctly: large Uthmani script (36pt front / 30pt back), with the flip side showing English translation, ALA-LC transliteration, and pills for any words the learner tapped during reading. Rating options: "not yet" (reset to level 1), "partially" (drop one level, retry in 2h), "got it" (advance with SRS interval).
The Quran Lemmatizer
Quranic Arabic uses Uthmani script, which differs from modern MSA in several ways that break standard lemmatization:
Hamzat al-wasl restoration: When preceded by a proclitic, initial alef disappears. بِسْمِ ("in the name of") is بِ + اسْم — but after stripping the proclitic, you get سم, which doesn't match any lemma. The lemmatizer tries prepending alef (اسم) and checking the DB.
Ta maftouha → ta marbuta fallback: Quranic script writes some feminine endings as ت (open ta) where modern Arabic uses ة (closed ta). رَحْمَت vs رَحْمَة — the lemmatizer tries the ة variant when the ت form doesn't resolve.
Uthmani-specific diacritics: Characters like U+06E1 (Uthmani sukun), U+06DF (rounded zero), U+06E2 (small high meem) need special handling in both transliteration and diacritic stripping.
Creation pipeline for unknown Quranic words: (1) Tokenize verse, resolve against existing lemma DB. (2) Batch-translate unknowns via LLM, with a prompt emphasizing general Arabic meanings — "merciful, compassionate" not "Most Merciful." (3) Create new Lemma records with source="quran", link/create Root entries, set knowledge state to "encountered." (4) Trigger enrichment (forms, etymology, transliteration) in background.
All words in verse cards are tappable — content words show the full WordInfoCard (root, pattern, gloss, etymology), function words show a gloss-only card. A "Got it" button on the card front lets you skip the translation entirely when you already understand the verse — the same progressive scaffolding philosophy as tashkeel fading.
Podcast Generation
The podcast system generates personalized Arabic listening practice episodes — audio designed for walks, commutes, and other hands-free time. The inspiration is Michel Thomas, whose entire teaching method is audio-based: the teacher manages the memory, the student's only job is to listen and engage. Pimsleur's Graduated Interval Recall is the only audio method with formal spaced repetition built in.
Why Podcasts Need Different Rules
A key finding from the research: listening comprehension requires 95-98% known word coverage, versus ~60% for reading with visual support. You can't tap an unknown word in audio to look it up. So the podcast system queries FSRS state and only includes sentences where virtually every word is already known or in late acquisition.
Format Library
Six formats, each a function that appends typed Seg objects (Arabic, English, or silence) to a segment list:
**Sentence Drill**: Arabic → English → Arabic for each sentence, then a recall pass (English → pause → Arabic) where you try to produce the Arabic before hearing it
**Story Breakdown**: Learn a short story sentence by sentence, replaying growing sequences, then the full story uninterrupted
**Comprehensible Input**: Mostly Arabic — brief English gloss for the target word, then the full sentence twice
**Root Explorer**: Walk through a root family (e.g., K-T-B: كِتَاب, كَاتِب, مَكْتَبَة) with sentences for each word — uniquely suited to Arabic morphology
**Word Spotlight**: Focus on currently-acquiring words with rich context
**Story Retelling**: Full story from the story library, read at natural pace
Technical Implementation
podcast_service.py generates segments, caches each one by content hash (SHA256 of text + language + speed + slow_mode), and stitches them into MP3 using pydub/ffmpeg. Arabic TTS runs at 0.75x speed with learner pauses (commas inserted every 2 words in slow mode); English at normal speed. Voice is the same PVC clone used for story narration.
The first sampler episode: 15.6 minutes, 159 TTS calls (~9,700 characters), generated in 3.5 minutes. Frontend has a player with play/pause/seek and background audio support.
Auto-Generation & Completion Tracking
A cron job maintains ≥4 unheard podcasts (max 2 per run to control ElevenLabs TTS cost), alternating between story-based and comprehensible input formats. A story-to-podcast pipeline extracts sentences from DB stories; long sentences (≥8 words) auto-chunk into ~4-word pieces, taught incrementally before the full sentence.
Completion with word credit: When a podcast is marked complete, times_heard is incremented on UserLemmaKnowledge for all content words. Lemma IDs are pre-computed at generation time and stored in podcast metadata JSON — no additional DB lookups at completion time.
The Frontend
Screen Inventory
Reading (index.tsx) — main sentence review, the core daily loop Learn (learn.tsx) — new word introduction with rich info cards Stats (stats.tsx) — dashboard: vocabulary funnel, textbook benchmarks, Quran progress, retention Explore (explore.tsx) — browse words/roots/patterns with smart filters Stories (stories.tsx) — generate, import, or read stories Scanner (scanner.tsx) — OCR book import Word Detail (word/[id].tsx) — full word profile: forms, root family, etymology, review history Root Detail (root/[id].tsx) — root family view with pattern groupings Pattern Detail (pattern/[id].tsx) — morphological pattern with examples Story Reader (story/[id].tsx) — tap-to-lookup Arabic reader with audio playback Podcasts (podcasts.tsx) — generated listening practice with format sampler Listening (listening.tsx) — dedicated listening practice mode Review Lab (review-lab.tsx) — experimental review features Words (words.tsx) — full word list browser Book Import (book-import.tsx) — import books with word-level tracking Book Page (book-page.tsx) — per-page reading view for imported books Chats (chats.tsx) — AI chat conversations More (more.tsx) — activity log, topic selection, tashkeel settings
The Review Flow
A typical reading session:
**Session load**: API returns 10-20 sentences with interleaved intro candidates
**Pre-session modals** (if any): Grammar refreshers, reintroduction cards, experiment cards
**Main loop** — for each sentence:
Arabic text displayed (38px Scheherazade New, full diacritics)
User taps words to see morphological info (root family, pattern, forms, etymology)
User reveals English translation
User marks each word: unmarked (understood), yellow (confused), red (missed)
**Wrap-up quiz** (if ≥2 cards): Word cards from the session — flip to reveal meaning, rate "Got it" or "Missed"
**Session end**: Results summary, per-word outcomes, new knowledge state transitions
Three-Tap Marking
First tap: RED (missed) + word info card appears Second tap: RED → YELLOW (confused) + confusion analysis loads Third tap: YELLOW → OFF (clears marking)
Only marked words (red/yellow) are submitted as failed. Everything else counts as understood.
Word Info Card (On Tap)
When you tap a word during review:
Header — English gloss, Arabic lemma, transliteration, POS badge Forms strip — up to 3 relevant forms (plural, feminine, masdar) with transliteration Chips — frequency rank, root (tappable → root detail), pattern (tappable → pattern detail) Root family siblings — known/learning words sharing the same root Derivation bridge (for Forms II-X verbs) — shows the verb form relationship, e.g., "causative/intensive of عَلِمَ — to know" for عَلَّمَ (Form II). Maps all ten derived verb forms to their semantic relationship with Form I Confusion analysis (only when marked yellow) — similar-looking words with character-level diff highlighting
Design System
Dark theme with semantic color coding:
Background: #0f0f1a (deep navy)
Arabic text: #f0f0ff (bright lavender — maximum legibility against dark)
Green (#2ecc71): Known / understood
Orange (#e67e22): Learning / in progress
Blue (#3498db): Acquiring
Red (#e74c3c): Missed
Yellow (#f39c12): Confused
CEFR spectrum: A1 green → A2 dark green → B1 blue → B2 orange → C1 red → C2 purple
Dual Arabic fonts (50/50 mixing): Review cards alternate between Scheherazade New (SIL, learner-optimized, conservative ligatures) and Amiri (Bulaq press tradition, aggressive ligatures matching printed books). Deterministic split by sentence_id % 2. Builds familiarity with both learner-friendly and print-style typography — important because real Arabic books use fonts with much more aggressive ligatures than learner materials.
Graduated tashkeel fading: Diacritical marks are hidden based on word stability, but with different thresholds for target and scaffold words. Target/due words (the ones being tested) keep tashkeel until FSRS stability ≥ 90 days. Scaffold words (mature context words) fade at stability ≥ 30 days — a third of the threshold, because removing retrieval cues is most beneficial when the learner is likely to succeed (Bjork desirable difficulties). Full tashkeel is always restored on the card back (verification). A 3-state per-card toggle (dot cycles: default → all vowels → no vowels) lets me override for individual cards. This transitions toward reading Arabic as it's actually written — without vowel marks.
Arabic font sizing: 38px for sentences (primary reading), 36px for word focus cards, 24px for secondary headers, 20px for lists.
RTL handling: React Native's native writingDirection: "rtl". Grapheme cluster handling for correct text highlighting with diacritics.
Offline-First
Full sync queue for all mutable actions: sentence reviews, story actions, word introductions, reintroduction results, experiment intro acknowledgments, grammar introductions
Auto-prefetch: 2 sessions cached in background after every session load; deep prefetch (up to 20) via More tab button
Background session refresh on 15-minute gap detection
12-second fetch timeout with stale-cache fallback
Word lookups cached in AsyncStorage with 24h TTL and stale fallback when offline
30-minute session staleness TTL, stale sessions allowed when offline
Data, Experimentation & Tuning
The A/B Testing Framework
Alif has a built-in experiment framework. Words are randomly assigned to experiment groups, and outcomes are tracked separately.
Card Introduction A/B Test (the first experiment — concluded after 5 weeks, 264 words): Group A (122 words): sentence-only introduction (encounter words in context) Group B (142 words): rich info card first (etymology, forms, memory hooks), then sentences Final results: card-first produced +28pp first-review accuracy (65% vs 37%), 2.4x faster graduation (11h vs 26h median), 73% vs 57% graduation rate, with similar post-graduation FSRS performance (95% vs 96%) Decision: all new words now get info cards. The leech rate difference seen in early data narrowed as the experiment matured. Rescue cards were added for stuck words (acquiring, ≥4 reviews, <50% accuracy) with a 7-day cooldown.
Algorithm Tuning From Data
The warmup effect: First review of the day has 78.1% accuracy vs 94.8% for later reviews — a 16.7 percentage point gap. This led to changing the algorithm to prioritize high-stability (easy) words for the first session of each day.
The 0% leech recovery bug: Analysis of 36 words with leech history showed none had ever recovered. Root cause: cumulative accuracy tracking made recovery mathematically near-impossible. A word seen 10 times with 30% accuracy would need 4 consecutive correct reviews just to reach 50%. Fix: partial stat reset on reintroduction (times_seen = max(3, times_seen // 2)).
Pipeline bottleneck analysis: Found that 80.9% of all generated sentences were retired without ever being shown. 49% of the sentence pool served tier-4 words (due in 72h+) that didn't need sentences yet. Fix: tier-based sentence lifecycle — sentences for far-future words are immediately retired.
Introduction rate tuning: Accuracy-driven ramp: the system monitors recent accuracy and adjusts how many new words to introduce. At 89.8% overall accuracy (above the 85% "desirable difficulty" zone), there was room to increase from 4 to 7-10 new words per session.
Sentence diversity problem: Analysis found هل (question particle) started 24.2% of all sentences, and 87 lemmas appeared in 20+ sentences creating scaffold overexposure. Led to diversity scoring in the session assembly algorithm.
The encountered word dead zone: 387 words imported from Duolingo, textbook scans, and story imports were stuck in an "encountered" state — the system acknowledged their existence but gave them zero review credit when they appeared in sentences. A word like قَرَأَ ("to read") had been seen 100+ times in stories and reviewed sentences but was invisible to the review engine. Paradoxically, a completely unknown word (no record at all) was auto-introduced on collateral appearance. Being "encountered" was worse than being unknown. Fix: every word in every sentence earns review credit, no exceptions. Encountered words appearing in reviewed sentences get auto-introduced and immediately reviewed. Tier 0 instant graduation handles the ones already familiar — first correct review goes straight to FSRS.
Retention % replacing "due" counts: The stats and session-end screens originally showed "42 reviewed / 342 remaining" — but the "remaining" count grows with vocabulary (always 100-300+ for a 1200-word learner), is never clearable, and was purely anxiety-inducing without being actionable. Replaced with a single metric: 7-day retention percentage, color-coded (green ≥90%, amber 85-89%, red <85%). Retention % tells you whether to do more sessions (dropping → more needed) or relax (holding → stable). Consistent with Alif's core philosophy: the system manages the schedule, the learner's only job is to show up.
Confused words → Rating.Hard: Confused (yellow-marked) words were getting Rating 3 (Good) — the same FSRS treatment as perfectly recognized words. With stability over 60 days, the next review could be weeks away despite the user demonstrating they didn't recognize the word. Analysis of 136 confusion events showed all occurred on words with 60+ day stability, with an 84% recovery rate on the next review. Changed to Rating 2 (Hard), which reduces interval growth (~1.2x vs ~2.5x) without lapsing the card. Semantically appropriate: "I got it, but it was hard."
Textbook Benchmarks (Replacing CEFR)
The stats screen originally showed CEFR level (A1→C2) estimated from vocabulary size. But CEFR is a framework for general communicative competence — pinning it to vocabulary count felt misleading and unactionable. Replaced with textbook frequency benchmarks: the system measures what percentage of words from corpus-derived frequency tiers the learner knows. "You know 78% of the 500 most common words in Arabic" is concrete and actionable; "You're A2" is vague.
The stats screen now also shows Quran progress — verse mastery across surahs, with per-surah breakdown — and pages read (all-time and this week) from imported books and OCR content.
Lemma Quality Gate
A centralized quality gate ensures every lemma entering the system — whether from sentence generation, Quran verse lemmatization, OCR import, or manual creation — meets minimum data requirements (gloss, root, transliteration). An LLM safety net guarantees 100% gloss coverage for Quran words. This was added after discovering glossless words reaching the frontend review cards, showing Arabic text with no English meaning available.
Analytics Capabilities
Because all data is in a SQLite database, I can query anything:
Per-word acquisition time (median 5 days, 7 reviews)
Accuracy by time of day, session position, word source
Leech rate by introduction method
Root family coverage and completion
Vocabulary coverage of any target text
Textbook frequency tier coverage
I regularly run analysis sessions with Claude, asking it to query the database and identify patterns. This has led to most of the algorithm improvements described above.
Simulation Framework
Before deploying algorithm changes to production (where the only user is me, and bad changes waste real learning time), I can simulate multi-day learning journeys:
python3 scripts/simulate_sessions.py --days 30 --profile beginnerFive learner profiles: beginner (55% accuracy), casual (70%), intensive (75%), calibrated (80%, derived from my actual production data), and strong (85%). The simulator runs the full session assembly, FSRS scheduling, and acquisition pipeline, producing charts of vocabulary growth, graduation rates, and leech accumulation. This caught several bugs that would have taken days to surface in real usage — like a graduation gate that blocked collateral reviews from triggering advancement.
Development Story
Timeline
Feb 8 — MVP: sentence architecture, Docker deployment, review UI (66 commits) Feb 9 — Story reader, TTS, offline sync Feb 10-11 — Learn mode, CAMeL Tools integration, variant detection (52 commits) Feb 12 — OCR scanner, quality gates, root extraction, 48 grammar features (59 commits) Feb 13-15 — Book import, CEFR predictions, stats dashboard, comprehensibility gate Feb 16-18 — Three-phase word lifecycle, FSRS refinements, story generation benchmark (39 commits) Feb 19-22 — Memory hooks (overgenerate-and-rank), session performance fix (18s→1.2s), Explore tab Feb 23-27 — A/B testing framework, confusion analysis, tiered graduation, offline reading Mar 1-4 — Voice cloning, UI redesign sprint, batch OCR, phonetic similarity analysis Mar 7-12 — Verb conjugation recognition, root/pattern info across all cards, lemmatization feedback loop Mar 14-17 — Unified sentence pipeline, mapping correction (verify-then-correct), intro card aggression tuning, leech sliding window Mar 18-21 — Collateral credit fix, tashkeel fading + dual fonts, A/B experiment conclusion, comprehensive pipeline audit Mar 22-23 — Podcast system, story audio + archive + format diversity, homograph correction, multi-hop variant chains Mar 24-26 — DB write lock discipline, derivation bridge, pre-reading clusters, graduated tashkeel fading, podcast auto-generation Mar 27-30 — Quran reading mode with Uthmani lemmatizer, verse SRS, intro card interleaving, Quran word interaction Mar 31 — Stats overhaul (textbook benchmarks replacing CEFR), lemma quality gate, frequency backfill, Quran refinements
Total: 610 commits across 80+ Claude Code conversations. About 750 user messages — mostly short directives (average 20 words for genuine prompts), with the longest single conversation running to 110+ messages. Many conversations were analysis sessions rather than pure coding — querying the learning database, identifying patterns, and proposing algorithm changes in the same session that implemented them.
Day 1: From Zero to Studying on My Phone
The first message to Claude Code was sent at 8:47 AM on February 8: "I am going to make an app to learn Arabic. I first need you to do extensive research using subagents..." Five research agents launched simultaneously — Arabic morphology tools, NLP APIs, datasets, diacritization, learning architecture — and all five completed within 8 minutes.
A key early constraint: the architecture had to be designed so that Claude Code could independently test both the algorithm (via pytest/curl) and the UI (in a browser), without the two being coupled. This led directly to the FastAPI + Expo split.
By 9:31 AM, 302 lexemes had been extracted from my Duolingo Arabic progress as seed vocabulary. By mid-morning, agent swarms were building backend, frontend, data storage, and the learning algorithm in parallel — with reconnaissance agents studying three of my existing projects (Comenius, NinjaOrd, Petrarca) to extract reusable patterns.
At 3:39 PM came the most important correction of the entire project: I noticed the app was building isolated word flashcards. "Right now we're doing plain words — I explicitly said sentences." The pivot to sentence-centric review happened immediately, but at that point zero sentences existed in the database.
By 6:30 PM the backend was running in Docker and the app was on my phone. By evening, I was actually studying with it. The first session revealed the FSRS stability concept through real data: "stability of 0.15 means FSRS predicts you'll forget this word in ~3.6 hours."
Development Intensity
80+ Claude Code conversations, 610 commits. The pace was intense in the first two weeks, then shifted: fewer conversations per day, but each one more targeted — an analysis session leading to an algorithm change, a specific bug traced through the database, a new feature designed and shipped in a single conversation. Many conversations weren't about writing code at all — they were about understanding the data. "My accuracy seems lower in the morning, can you check?" → SQL queries → discovery of the warmup effect → algorithm change → deploy, all in one session.
The Developer-as-User Feedback Loop
The most unusual thing about Alif's development is that the developer was the only user. Every evening's study session generated data that drove the next day's algorithm changes. Fast-track graduation, leech recovery, comprehensibility gates, function word handling — all emerged from real usage patterns, not hypothetical design. When I got confused by lil-kitab (a preposition fused with the definite article), that confusion spawned a grammar detection system. When function words kept showing up as zombie review cards every 2.3 days, that led to their exclusion from the FSRS pipeline in favor of collateral credit.
The RTL Touch Target Crisis (Day 1)
One of the hardest UI bugs: tapping Arabic words in stories didn't work. Wrapped RTL (right-to-left) text in React Native has fundamentally broken touch targets. Four consecutive approaches failed: row-reverse, direction: rtl, scaleX: -1 flip, individual Pressable per word. I eventually said "This does not work either. Deeply rethink your approach." The solution: a single Pressable over the entire text container, with onLayout per word recording its position, and coordinate-based hit detection from the touch event. No individual touch targets at all.
The Session Build Performance Fix
Session assembly was taking 18 seconds (blocking the UI on every session start). Profiling revealed the bottleneck was the lemma lookup — doing individual CAMeL Tools calls for every word in every candidate sentence. Fix: pre-build a fast dict-only lookup (build_lemma_lookup()) mapping bare forms to lemma IDs. Result: 18 seconds → 1.2 seconds.
Later, a more insidious performance bug appeared: a synchronous LLM verification gate added to the session builder caused 30-60 second timeouts. When Gemini timed out overnight, it locked the database for minutes — compounding with a CLI JSON parse failure that prevented the fallback chain from triggering, and a stats reload storm that fired 20 extra requests per session. Three bugs, each tolerable alone, that cascaded into complete session failure. The fix established a hard rule: no LLM calls in the session build critical path. All verification happens at generation time or in background tasks.
The 82% Sentence Rejection Crisis
At one point, 82% of generated sentences were being rejected by the comprehensibility validator. Investigation revealed the validator wasn't recognizing conjugated verb forms as "known words" — يَكْتُبُ (yaktub, "he writes") wasn't matching the lemma كَتَبَ (katab, "wrote"). Fix: generate ~33 conjugation forms per verb and check surface forms against all known forms. Rejection rate dropped dramatically.
Voice Cloning (March 1)
My favorite Arabic TikToker (RootsOfKnowledge) makes extremely clear MSA content with precise pronunciation. Downloaded 157 minutes of audio, created a Professional Voice Clone on ElevenLabs, and used it for story narration at 0.8x speed. This is the kind of deeply personal feature that a commercial app could never offer — but for a single-user system, it's trivial.
Research Foundations
Key principles from the research literature that shaped Alif's design:
Vocabulary Acquisition
8-12 meaningful encounters needed for stable vocabulary (Uchihara et al. 2019 meta-analysis)
First 2 encounters determine learning to the largest extent
<6 encounters = <30% recall after 1 week; 10+ encounters = 80%+ recall
Context diversity, retrieval effort, and modality matter — not just repetition count
Spacing & Acquisition
Optimal initial schedule: 3-4 exposures on day 0, critical review on day 1, then day 3-4, then day 7-10, then FSRS takes over
Expanding spacing (lag 0-1-5-9) better for initial acquisition
FSRS has no native learning phase — the Leitner acquisition system fills this gap
Interference & Confusable Words
~80% of Arabic letters share base form (rasm), differentiated only by dots
Semantically similar items presented together cause 1.6x more interference errors
For visually similar Arabic words: interleaving (not blocking) forces discrimination
Arabic Morphology
L2 learners demonstrably organize Arabic lexicons by root (psychologically real)
Combining reading + morphological awareness training = best vocabulary gains
Root priming is measurable — knowing one family member helps learn others
Diacritization
Diacritized text consistently helps L2 learners at all levels
No evidence of harmful dependency (learners don't become "addicted" to diacritics)
Graduated fading is desirable difficulty (Bjork): removing cues when learner is likely to succeed improves long-term retention
Pretesting & Failed Retrieval
Failed retrieval attempts improve subsequent learning by 19pp even with 95% miss rate (Richland, Kornell & Kao 2009, d=1.1)
Root-family knowledge provides ~50-70% semantic access to unknown words (Boudelaa & Marslen-Wilson 2013)
Pre-reading exposure to unknown words (even failed attempts) primes the encoding process
Key Constants Reference
Acquisition (Leitner)
Box 1 interval: 4 hours (within-session advancement allowed) Box 2 interval: 1 day (sleep consolidation) Box 3 interval: 3 days (long-term consolidation) Graduation min reviews: 5 (standard path, Tier 3) Graduation min accuracy: 60% (standard path) Graduation min calendar days: 2 (standard path) Root sibling threshold: 2 (for stability boost on graduation)
Leech Management
Min reviews for leech: 5 Max accuracy for leech: 50% Cooldown (1st): 3 days Cooldown (2nd): 7 days Cooldown (3rd+): 14 days
Session Assembly
Max auto-intro per session: 5 Auto-intro accuracy floor: 70% Pipeline backlog threshold: 40 (dynamic: ≥90% acc → 80, ≥80% → 60) Max unknown scaffold per sentence: 2 Comprehensibility gate: 60% known scaffold Due coverage exponent: 1.5 Source bonus (books): 1.3x Session diversity decay: 0.5 per reuse Recency (understood): 1 day Recency (partial): 4 hours Recency (no idea): 30 minutes
Sentence Generation
Max retries: 7 Known vocabulary sample: 500 words Max avoid words: 30
Pipeline Tiers
Tier 1 (≤12h) — 3 sentences, floor 2 Tier 2 (12-36h) — 2 sentences, floor 1 Tier 3 (36-72h) — 1 sentence, floor 0 Tier 4 (72h+) — 0, retired
Database Schema (Summary)
The full schema has 23 tables. Key ones:
roots — trilateral root registry (root, core_meaning_en, enrichment_json) lemmas — word dictionary (lemma_ar, root_id, pos, gloss_en, frequency_rank, forms_json, etymology_json, memory_hooks_json, wazn) user_lemma_knowledge — per-word learning state (knowledge_state, fsrs_card_json, acquisition_box, times_seen, times_correct, leech_count) sentences — sentence corpus (arabic_diacritized, english_translation, target_lemma_id, times_shown, difficulty_score) sentence_words — token-level mapping (sentence_id, position, surface_form, lemma_id) review_log — complete review history (lemma_id, rating, session_id, comprehension_signal, credit_type, was_confused) stories — stories and imported texts (body_ar, body_en, status, readiness_pct, total_words, known_count) story_words — word-level in stories (story_id, position, lemma_id, is_known_at_creation) grammar_features — 49 features across 8 tiers (category, feature_key, form_change_type) user_grammar_exposure — grammar learning tracking (feature_id, times_seen, comfort_score) content_flags — user-reported corrections (content_type, original_value, corrected_value, status) pattern_info — morphological pattern metadata, 46 patterns (wazn, wazn_meaning, enrichment_json) quranic_verses — full Quran, 6,236 verses with SRS state (surah, ayah, arabic_text, srs_level, next_due) quranic_verse_words — token-level Quran word mappings (verse_id, position, surface_form, lemma_id) pipeline_snapshots — daily pipeline state tracking (snapshot_date, metrics_json)
This article documents Alif as of late March 2026, after 7.5 weeks of development and daily use. 610 commits, 2,060 tests, 80+ Claude Code conversations. The system continues to evolve.

Thank you for sharing, Stian! I will definitely explore with it and will let you know.