JariLingo • 2026
How I Built a Real-Time AI to Break the Sign Language Barrier

ROLE
AI Mobile Engineer
& Tech Lead
TIMELINE
May - Jun 2026
Team
Solo Developer
SKILLS
SwiftUI & SwiftData
CoreML
Python & TensorFlow
Overview
How I Built a Real-Time AI to Break the Sign Language Barrier
A Deep-Dive Technical Case Study on Engineering a Real-Time Indonesian Sign Language (BISINDO) Recognition System.
There are over 2.6 million Deaf individuals in Indonesia whose primary language is Indonesian Sign Language (BISINDO). Unfortunately, the hearing majority lacks accessible ways to learn it. Popular apps like Duolingo have successfully democratized spoken languages, but sign languages lag behind because they require real-time spatial and visual analysis, not just text or audio matching.
My mission was to democratize BISINDO learning by designing an interactive, gamified, and inclusive iOS application. The absolute requirements: The AI system must run 100% On-Device for absolute privacy, process data at 30 FPS instantly without lag, and feel as magical as interacting with a real human instructor.




Phase 1
The Video-Based ML Trap & Strategic Pivot
My initial hypothesis was to train Apple's CreateML Action Classifier using dummy video samples I recorded myself.
Testing revealed abysmal accuracy (<40%) whenever there were changes in lighting, background, or user clothing. The Action Classifier analyzes bulk pixels as a whole, rather than anatomical structures.
This failure provided a crucial insight: the AI must be blind to colors and pixels, evaluating pure mathematical skeletons (Landmarks) instead. Furthermore, I realized the absolute necessity of large-scale, peer-reviewed academic data. Consequently, I integrated the WL-BISINDO Dataset (Kindy et al., 2025) a massive research repository containing 1,600 high-quality RGB videos across 32 isolated BISINDO signs.

Phase 2
Spatial Extraction & Kinematic Feature Engineering
Feeding raw videos directly into an AI model is bad practice. I engineered a Python data pipeline (`extract_landmarks.py`) powered by Google MediaPipe (`hand_landmarker.task` & `pose_landmarker_lite.task`).
# extract_landmarks.py - The Pivot to Skeleton-Based ML import mediapipe as mp # 1. Initialize Base Options for Hand & Pose Models BaseOptions = mp.tasks.BaseOptions VisionRunningMode = mp.tasks.vision.RunningMode # 2. Configure Hand Landmarker for Video Processing hand_options = mp.tasks.vision.HandLandmarkerOptions( base_options=BaseOptions(model_asset_path='hand_landmarker.task'), running_mode=VisionRunningMode.VIDEO, num_hands=1, min_hand_detection_confidence=0.5 ) # 3. Create instances to extract mathematical skeletons, ignoring pixels with mp.tasks.vision.HandLandmarker.create_from_options(hand_options) as landmarker: # Process 1,600 WL-BISINDO videos frame-by-frame... # (Extracting 21 X,Y,Z coordinates per frame) pass
extract_landmarks.py
I extracted 21 hand joints (63 spatial X,Y,Z coordinates) per frame, transforming 1,600 videos into a foundational CSV dataset of nearly 100 MB.

Sign language is not merely static poses; it is fundamentally about rhythm and velocity. In the `process_kinematics.py` script, I executed advanced Feature Engineering techniques:
Kalman Filter (63-dimensional)
Cleaned up visual jitter from the raw camera feed.
Derived Kinematics
Calculated Velocity (63 features) and Acceleration (63 features) between consecutive frames.
Temporal Windowing
Standardized the data sequences into a fixed time window of exactly 30 frames.
# process_kinematics.py - The Core Logic # 1. Denoising raw coordinates using 63-dimensional Kalman Filter kf = KalmanFilter(initial_state_mean=coords[0], n_dim_obs=63) smoothed_coords, _ = kf.smooth(coords) # 2. Deriving Kinematics (Velocity & Acceleration) velocity = np.zeros_like(smoothed_coords) velocity[1:] = smoothed_coords[1:] - smoothed_coords[:-1] acceleration = np.zeros_like(velocity) acceleration[1:] = velocity[1:] - velocity[:-1] # 3. Concatenating into a 189-dimensional Feature Tensor full_features = np.concatenate([smoothed_coords, velocity, acceleration], axis=1)
process_kinematics.py
This rigorous process successfully converted billions of video pixels into pure mathematical tensors of `(N, 30, 189)` dimensions (`X_data_fase_2.npy`), dramatically increasing the Signal-to-Noise Ratio for machine readability.
Phase 3
Training the "Brain" (Transitioning from LSTM to Transformer)
Traditional neural networks like LSTMs (Long Short-Term Memory) struggled to capture the long-range temporal dependencies of complex BISINDO signs where the beginning and end of a gesture heavily dictate its meaning.
I custom-designed and trained a Spatial-Temporal Transformer architecture from scratch using TensorFlow/Keras (`fase_3_train.py`).
I Implemented Sinusoidal Positional Encoding (following the legendary "Attention Is All You Need" paper) to inject temporal awareness into the AI.
Stacked 4 Multi-Head Attention blocks (`NUM_HEADS=4`, `D_MODEL=128`) ending with GlobalAveragePooling1D.
# fase_3_train.py - Custom Spatial-Temporal Transformer def build_bisindo_transformer(window_size=30, num_features=189, d_model=128): inputs = keras.Input(shape=(window_size, num_features)) # 1. Feature Projection & Temporal Awareness x = layers.Dense(d_model)(inputs) x = x + positional_encoding(window_size, d_model) # 2. Stacking 4 Multi-Head Attention Blocks for i in range(4): x = TransformerBlock(d_model, num_heads=4, ff_dim=256)(x) x = layers.GlobalAveragePooling1D()(x) outputs = layers.Dense(32, activation="softmax")(x) return keras.Model(inputs, outputs)
fase_3_train.py
The Result
Utilizing Cosine Decay Restarts for dynamic learning rates, this Transformer architecture broke through a validation accuracy of >90% across 32 vocabulary classes at Epoch 100, definitively proving its superiority over LSTMs for sign language recognition

Phase 4
Hacking CoreML Limitations
Apple's Neural Engine strictly requires the `.mlpackage` format. However, converting the Keras Transformer model using the `coremltools` library (in TensorFlow 2.17) resulted in fatal crashes due to structural incompatibilities with the `SavedModel` format (the `_DictWrapper` bug).
Instead of giving up, I performed an engineering hack (`fase_4_convert.py`). Rather than saving the model to a standard file format, I wrapped the Transformer layers inside a `@tf.function` (Concrete Function), bypassed the file system serialization entirely, and injected the execution graph directly into the converter's memory.
Furthermore, I forced the computational quantization to remain in FLOAT32 (as Float16 precision would destructively distort the Attention weights) and hardcoded the normalization matrices (`norm_mean`, `norm_std`) directly into the `.mlpackage` metadata.
The Result
A high-level AI model was successfully compressed into a tiny 1.3 MB file, 100% numerically validated, and ready to be embedded natively into iOS.
# fase_4_convert.py - Bypassing SavedModel format bugs # Wrapping the Keras model in a Concrete Function to bypass tf.saved_model @tf.function(input_signature=[ tf.TensorSpec(shape=(1, 30, 189), dtype=tf.float32) ]) def inference_fn(landmark_sequence): return wrapped_model(landmark_sequence, training=False) # Convert directly from memory to .mlpackage (FLOAT32) mlmodel = ct.convert( model=[inference_fn.get_concrete_function()], source="tensorflow", compute_precision=ct.precision.FLOAT32, # Prevent Attention distortion minimum_deployment_target=ct.target.iOS17 )
fase_4_convert.py
Phase 5
Real-Time UI/UX Optimization
During app testing, a fatal issue emerged: Main Thread Starvation. The iOS UI froze (lagged) for 5-10 seconds right after the user completed a correct gesture. Running 3 heavy ML models simultaneously while attempting to render UI triggered an extreme CPU bottleneck.
Three Surgical Strikes:
Dynamic ML Pausing
I implemented a semantic lock (`isTracking = false`) that instantly kills the CoreML processing circuit the exact millisecond a correct gesture is detected. This immediately frees 100% of the Neural Engine/GPU load.
Deferred Disk I/O
Shifted the SwiftData writing process (`modelContext.save()`) to a deferred background queue (`DispatchQueue.main.asyncAfter`), preventing violent collisions between SQLite locks and the SwiftUI render loop.
Lottie Eradication
Conducted deep UI profiling and discovered that initializing confetti animations via Lottie (JSON) forced the Main Thread to synchronously build thousands of `CALayer`s. I eradicated them and replaced them with ultra-lightweight Native SwiftUI animations.
// AppViewModel.swift - Main Thread Starvation Fix func handleCorrectGesture(for word: String) { // 1. Instantly kill CoreML pipeline to free 100% CPU/GPU resources self.isTracking = false // 2. Defer SwiftData Disk I/O to prevent SQLite write-locks // from colliding with SwiftUI Victory Screen render loop DispatchQueue.main.asyncAfter(deadline: .now() + 1.0) { self.saveProgressToSwiftData(word: word) } }
AppViewModel.swift
The Result
App performance skyrocketed. JariLingo now executes frictionless UI transitions with touch performance that feels like buttery-smooth 120 FPS, while silently running brutal ML computations in the background.
Reflection
What I Delivered (Impact):
I successfully shipped the JariLingo MVP, a production-grade iOS application. The app delivers an addictive, gamified learning experience (habit-loops, XP, Streaks), processes everything 100% On-Device for absolute privacy, and fully supports inclusive design via `UIAccessibility` integration (VoiceOver verbally reads out camera feedback for the visually impaired in real-time).
What I Gained (Personal Growth):
End-to-End Technical Mastery
I evolved from a standard iOS app developer into a full-stack AI/ML Engineer. I mastered the entire product lifecycle: from Python data engineering, Transformer architecture design, and low-level CoreML compilation, to deep Main Thread profiling in SwiftUI.
Project Management & Product Sense
I learned to ruthlessly prioritize features like a Tech Lead at Apple or Google. I made tough executive decisions (such as retaining the curriculum as local dummy data to ensure demo stability) to guarantee the Core Experience (the AI magic) was delivered flawlessly and without fail.
Empathy-Driven Engineering
Ultimately, the most sophisticated code means nothing if it doesn't solve a human problem. JariLingo is not just about Arrays and Tensors; it is a tangible manifestation of how software engineering can tear down invisible barriers and empower marginalized communities.



Final
Acknowledgements & Dataset Credit
The engineering success of JariLingo's ML pipeline is built upon the shoulders of giants. The entire Machine Learning pipeline was trained using the WL-BISINDO Dataset (Word-Level BISINDO), a revolutionary dataset comprising 1,600 RGB videos across 32 vocabulary classes of the Banten variant of BISINDO.
My deepest appreciation and gratitude go to the lead researchers: Grace Oktaviani Kindy, Glenn Leonali, and Henry Lucky, as well as the deaf sign language contributors Marvel, Tazkia, and Kevita for their dedication to building an inclusive technological ecosystem for the Deaf community in Indonesia.
Citation
If you are interested in exploring the extensive foundation of this dataset, please refer to their official academic publication:
Kindy, G. O., Leonali, G., & Lucky, H. (2025).
Word-Level BISINDO: A Novel Video Indonesian Sign Language Dataset and Baseline Methods. Procedia Computer Science, 269, 249-258. https://doi.org/10.1016/j.procs.2025.08.277
JariLingo • 2026
How I Built a Real-Time AI to Break the Sign Language Barrier

ROLE
AI Mobile Engineer
& Tech Lead
TIMELINE
May - Jun 2026
Team
Solo Developer
SKILLS
SwiftUI & SwiftData
CoreML
Python & TensorFlow
Overview
How I Built a Real-Time AI to Break the Sign Language Barrier
A Deep-Dive Technical Case Study on Engineering a Real-Time Indonesian Sign Language (BISINDO) Recognition System.
There are over 2.6 million Deaf individuals in Indonesia whose primary language is Indonesian Sign Language (BISINDO). Unfortunately, the hearing majority lacks accessible ways to learn it. Popular apps like Duolingo have successfully democratized spoken languages, but sign languages lag behind because they require real-time spatial and visual analysis, not just text or audio matching.
My mission was to democratize BISINDO learning by designing an interactive, gamified, and inclusive iOS application. The absolute requirements: The AI system must run 100% On-Device for absolute privacy, process data at 30 FPS instantly without lag, and feel as magical as interacting with a real human instructor.








Phase 1
The Video-Based ML Trap & Strategic Pivot
My initial hypothesis was to train Apple's CreateML Action Classifier using dummy video samples I recorded myself.
Testing revealed abysmal accuracy (<40%) whenever there were changes in lighting, background, or user clothing. The Action Classifier analyzes bulk pixels as a whole, rather than anatomical structures.
This failure provided a crucial insight: the AI must be blind to colors and pixels, evaluating pure mathematical skeletons (Landmarks) instead. Furthermore, I realized the absolute necessity of large-scale, peer-reviewed academic data. Consequently, I integrated the WL-BISINDO Dataset (Kindy et al., 2025) a massive research repository containing 1,600 high-quality RGB videos across 32 isolated BISINDO signs.


Phase 2
Spatial Extraction & Kinematic Feature Engineering
Feeding raw videos directly into an AI model is bad practice. I engineered a Python data pipeline (`extract_landmarks.py`) powered by Google MediaPipe (`hand_landmarker.task` & `pose_landmarker_lite.task`).
# extract_landmarks.py - The Pivot to Skeleton-Based ML import mediapipe as mp # 1. Initialize Base Options for Hand & Pose Models BaseOptions = mp.tasks.BaseOptions VisionRunningMode = mp.tasks.vision.RunningMode # 2. Configure Hand Landmarker for Video Processing hand_options = mp.tasks.vision.HandLandmarkerOptions( base_options=BaseOptions(model_asset_path='hand_landmarker.task'), running_mode=VisionRunningMode.VIDEO, num_hands=1, min_hand_detection_confidence=0.5 ) # 3. Create instances to extract mathematical skeletons, ignoring pixels with mp.tasks.vision.HandLandmarker.create_from_options(hand_options) as landmarker: # Process 1,600 WL-BISINDO videos frame-by-frame... # (Extracting 21 X,Y,Z coordinates per frame) pass
extract_landmarks.py
I extracted 21 hand joints (63 spatial X,Y,Z coordinates) per frame, transforming 1,600 videos into a foundational CSV dataset of nearly 100 MB.


Sign language is not merely static poses; it is fundamentally about rhythm and velocity. In the `process_kinematics.py` script, I executed advanced Feature Engineering techniques:
Kalman Filter (63-dimensional)
Cleaned up visual jitter from the raw camera feed.
Derived Kinematics
Calculated Velocity (63 features) and Acceleration (63 features) between consecutive frames.
Temporal Windowing
Standardized the data sequences into a fixed time window of exactly 30 frames.
# process_kinematics.py - The Core Logic # 1. Denoising raw coordinates using 63-dimensional Kalman Filter kf = KalmanFilter(initial_state_mean=coords[0], n_dim_obs=63) smoothed_coords, _ = kf.smooth(coords) # 2. Deriving Kinematics (Velocity & Acceleration) velocity = np.zeros_like(smoothed_coords) velocity[1:] = smoothed_coords[1:] - smoothed_coords[:-1] acceleration = np.zeros_like(velocity) acceleration[1:] = velocity[1:] - velocity[:-1] # 3. Concatenating into a 189-dimensional Feature Tensor full_features = np.concatenate([smoothed_coords, velocity, acceleration], axis=1)
process_kinematics.py
This rigorous process successfully converted billions of video pixels into pure mathematical tensors of `(N, 30, 189)` dimensions (`X_data_fase_2.npy`), dramatically increasing the Signal-to-Noise Ratio for machine readability.
Phase 3
Training the "Brain" (Transitioning from LSTM to Transformer)
Traditional neural networks like LSTMs (Long Short-Term Memory) struggled to capture the long-range temporal dependencies of complex BISINDO signs where the beginning and end of a gesture heavily dictate its meaning.
I custom-designed and trained a Spatial-Temporal Transformer architecture from scratch using TensorFlow/Keras (`fase_3_train.py`).
I Implemented Sinusoidal Positional Encoding (following the legendary "Attention Is All You Need" paper) to inject temporal awareness into the AI.
Stacked 4 Multi-Head Attention blocks (`NUM_HEADS=4`, `D_MODEL=128`) ending with GlobalAveragePooling1D.
# fase_3_train.py - Custom Spatial-Temporal Transformer def build_bisindo_transformer(window_size=30, num_features=189, d_model=128): inputs = keras.Input(shape=(window_size, num_features)) # 1. Feature Projection & Temporal Awareness x = layers.Dense(d_model)(inputs) x = x + positional_encoding(window_size, d_model) # 2. Stacking 4 Multi-Head Attention Blocks for i in range(4): x = TransformerBlock(d_model, num_heads=4, ff_dim=256)(x) x = layers.GlobalAveragePooling1D()(x) outputs = layers.Dense(32, activation="softmax")(x) return keras.Model(inputs, outputs)
fase_3_train.py
The Result
Utilizing Cosine Decay Restarts for dynamic learning rates, this Transformer architecture broke through a validation accuracy of >90% across 32 vocabulary classes at Epoch 100, definitively proving its superiority over LSTMs for sign language recognition


Phase 4
Hacking CoreML Limitations
Apple's Neural Engine strictly requires the `.mlpackage` format. However, converting the Keras Transformer model using the `coremltools` library (in TensorFlow 2.17) resulted in fatal crashes due to structural incompatibilities with the `SavedModel` format (the `_DictWrapper` bug).
Instead of giving up, I performed an engineering hack (`fase_4_convert.py`). Rather than saving the model to a standard file format, I wrapped the Transformer layers inside a `@tf.function` (Concrete Function), bypassed the file system serialization entirely, and injected the execution graph directly into the converter's memory.
Furthermore, I forced the computational quantization to remain in FLOAT32 (as Float16 precision would destructively distort the Attention weights) and hardcoded the normalization matrices (`norm_mean`, `norm_std`) directly into the `.mlpackage` metadata.
The Result
A high-level AI model was successfully compressed into a tiny 1.3 MB file, 100% numerically validated, and ready to be embedded natively into iOS.
# fase_4_convert.py - Bypassing SavedModel format bugs # Wrapping the Keras model in a Concrete Function to bypass tf.saved_model @tf.function(input_signature=[ tf.TensorSpec(shape=(1, 30, 189), dtype=tf.float32) ]) def inference_fn(landmark_sequence): return wrapped_model(landmark_sequence, training=False) # Convert directly from memory to .mlpackage (FLOAT32) mlmodel = ct.convert( model=[inference_fn.get_concrete_function()], source="tensorflow", compute_precision=ct.precision.FLOAT32, # Prevent Attention distortion minimum_deployment_target=ct.target.iOS17 )
fase_4_convert.py
Phase 5
Real-Time UI/UX Optimization
During app testing, a fatal issue emerged: Main Thread Starvation. The iOS UI froze (lagged) for 5-10 seconds right after the user completed a correct gesture. Running 3 heavy ML models simultaneously while attempting to render UI triggered an extreme CPU bottleneck.
Three Surgical Strikes:
Dynamic ML Pausing
I implemented a semantic lock (`isTracking = false`) that instantly kills the CoreML processing circuit the exact millisecond a correct gesture is detected. This immediately frees 100% of the Neural Engine/GPU load.
Deferred Disk I/O
Shifted the SwiftData writing process (`modelContext.save()`) to a deferred background queue (`DispatchQueue.main.asyncAfter`), preventing violent collisions between SQLite locks and the SwiftUI render loop.
Lottie Eradication
Conducted deep UI profiling and discovered that initializing confetti animations via Lottie (JSON) forced the Main Thread to synchronously build thousands of `CALayer`s. I eradicated them and replaced them with ultra-lightweight Native SwiftUI animations.
// AppViewModel.swift - Main Thread Starvation Fix func handleCorrectGesture(for word: String) { // 1. Instantly kill CoreML pipeline to free 100% CPU/GPU resources self.isTracking = false // 2. Defer SwiftData Disk I/O to prevent SQLite write-locks // from colliding with SwiftUI Victory Screen render loop DispatchQueue.main.asyncAfter(deadline: .now() + 1.0) { self.saveProgressToSwiftData(word: word) } }
AppViewModel.swift
The Result
App performance skyrocketed. JariLingo now executes frictionless UI transitions with touch performance that feels like buttery-smooth 120 FPS, while silently running brutal ML computations in the background.
Reflection
What I Delivered (Impact):
I successfully shipped the JariLingo MVP, a production-grade iOS application. The app delivers an addictive, gamified learning experience (habit-loops, XP, Streaks), processes everything 100% On-Device for absolute privacy, and fully supports inclusive design via `UIAccessibility` integration (VoiceOver verbally reads out camera feedback for the visually impaired in real-time).
What I Gained (Personal Growth):
End-to-End Technical Mastery
I evolved from a standard iOS app developer into a full-stack AI/ML Engineer. I mastered the entire product lifecycle: from Python data engineering, Transformer architecture design, and low-level CoreML compilation, to deep Main Thread profiling in SwiftUI.
Project Management & Product Sense
I learned to ruthlessly prioritize features like a Tech Lead at Apple or Google. I made tough executive decisions (such as retaining the curriculum as local dummy data to ensure demo stability) to guarantee the Core Experience (the AI magic) was delivered flawlessly and without fail.
Empathy-Driven Engineering
Ultimately, the most sophisticated code means nothing if it doesn't solve a human problem. JariLingo is not just about Arrays and Tensors; it is a tangible manifestation of how software engineering can tear down invisible barriers and empower marginalized communities.






Final
Acknowledgements & Dataset Credit
The engineering success of JariLingo's ML pipeline is built upon the shoulders of giants. The entire Machine Learning pipeline was trained using the WL-BISINDO Dataset (Word-Level BISINDO), a revolutionary dataset comprising 1,600 RGB videos across 32 vocabulary classes of the Banten variant of BISINDO.
My deepest appreciation and gratitude go to the lead researchers: Grace Oktaviani Kindy, Glenn Leonali, and Henry Lucky, as well as the deaf sign language contributors Marvel, Tazkia, and Kevita for their dedication to building an inclusive technological ecosystem for the Deaf community in Indonesia.
Citation
If you are interested in exploring the extensive foundation of this dataset, please refer to their official academic publication:
Kindy, G. O., Leonali, G., & Lucky, H. (2025).
Word-Level BISINDO: A Novel Video Indonesian Sign Language Dataset and Baseline Methods. Procedia Computer Science, 269, 249-258. https://doi.org/10.1016/j.procs.2025.08.277