JariLingo • 2026

How I Built a Real-Time AI to Break the Sign Language Barrier

ROLE

AI Mobile Engineer
& Tech Lead

TIMELINE

May - Jun 2026

Team

Solo Developer

SKILLS

SwiftUI & SwiftData

CoreML

Python & TensorFlow

Overview

How I Built a Real-Time AI to Break the Sign Language Barrier

A Deep-Dive Technical Case Study on Engineering a Real-Time Indonesian Sign Language (BISINDO) Recognition System.

There are over 2.6 million Deaf individuals in Indonesia whose primary language is Indonesian Sign Language (BISINDO). Unfortunately, the hearing majority lacks accessible ways to learn it. Popular apps like Duolingo have successfully democratized spoken languages, but sign languages lag behind because they require real-time spatial and visual analysis, not just text or audio matching.

My mission was to democratize BISINDO learning by designing an interactive, gamified, and inclusive iOS application. The absolute requirements: The AI system must run 100% On-Device for absolute privacy, process data at 30 FPS instantly without lag, and feel as magical as interacting with a real human instructor.

Phase 1

The Video-Based ML Trap & Strategic Pivot

My initial hypothesis was to train Apple's CreateML Action Classifier using dummy video samples I recorded myself.

Testing revealed abysmal accuracy (<40%) whenever there were changes in lighting, background, or user clothing. The Action Classifier analyzes bulk pixels as a whole, rather than anatomical structures.

This failure provided a crucial insight: the AI must be blind to colors and pixels, evaluating pure mathematical skeletons (Landmarks) instead. Furthermore, I realized the absolute necessity of large-scale, peer-reviewed academic data. Consequently, I integrated the WL-BISINDO Dataset (Kindy et al., 2025) a massive research repository containing 1,600 high-quality RGB videos across 32 isolated BISINDO signs.

https://www.kaggle.com/datasets/glennleonali/wl-bisindo

Phase 2

Spatial Extraction & Kinematic Feature Engineering

Feeding raw videos directly into an AI model is bad practice. I engineered a Python data pipeline (`extract_landmarks.py`) powered by Google MediaPipe (`hand_landmarker.task` & `pose_landmarker_lite.task`).

# extract_landmarks.py - The Pivot to Skeleton-Based ML
import mediapipe as mp

# 1. Initialize Base Options for Hand & Pose Models
BaseOptions = mp.tasks.BaseOptions
VisionRunningMode = mp.tasks.vision.RunningMode

# 2. Configure Hand Landmarker for Video Processing
hand_options = mp.tasks.vision.HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='hand_landmarker.task'),
    running_mode=VisionRunningMode.VIDEO,
    num_hands=1,
    min_hand_detection_confidence=0.5
)

# 3. Create instances to extract mathematical skeletons, ignoring pixels
with mp.tasks.vision.HandLandmarker.create_from_options(hand_options) as landmarker:
    # Process 1,600 WL-BISINDO videos frame-by-frame...
    # (Extracting 21 X,Y,Z coordinates per frame)
    pass

extract_landmarks.py

I extracted 21 hand joints (63 spatial X,Y,Z coordinates) per frame, transforming 1,600 videos into a foundational CSV dataset of nearly 100 MB.

Sign language is not merely static poses; it is fundamentally about rhythm and velocity. In the `process_kinematics.py` script, I executed advanced Feature Engineering techniques:

Kalman Filter (63-dimensional)

Cleaned up visual jitter from the raw camera feed.

Derived Kinematics

Calculated Velocity (63 features) and Acceleration (63 features) between consecutive frames.

Temporal Windowing

Standardized the data sequences into a fixed time window of exactly 30 frames.

# process_kinematics.py - The Core Logic
# 1. Denoising raw coordinates using 63-dimensional Kalman Filter
kf = KalmanFilter(initial_state_mean=coords[0], n_dim_obs=63)
smoothed_coords, _ = kf.smooth(coords)

# 2. Deriving Kinematics (Velocity & Acceleration)
velocity = np.zeros_like(smoothed_coords)
velocity[1:] = smoothed_coords[1:] - smoothed_coords[:-1]

acceleration = np.zeros_like(velocity)
acceleration[1:] = velocity[1:] - velocity[:-1]

# 3. Concatenating into a 189-dimensional Feature Tensor
full_features = np.concatenate([smoothed_coords, velocity, acceleration], axis=1)

process_kinematics.py

This rigorous process successfully converted billions of video pixels into pure mathematical tensors of `(N, 30, 189)` dimensions (`X_data_fase_2.npy`), dramatically increasing the Signal-to-Noise Ratio for machine readability.

Phase 3

Training the "Brain" (Transitioning from LSTM to Transformer)

Traditional neural networks like LSTMs (Long Short-Term Memory) struggled to capture the long-range temporal dependencies of complex BISINDO signs where the beginning and end of a gesture heavily dictate its meaning.

I custom-designed and trained a Spatial-Temporal Transformer architecture from scratch using TensorFlow/Keras (`fase_3_train.py`).

I Implemented Sinusoidal Positional Encoding (following the legendary "Attention Is All You Need" paper) to inject temporal awareness into the AI.

Stacked 4 Multi-Head Attention blocks (`NUM_HEADS=4`, `D_MODEL=128`) ending with GlobalAveragePooling1D.

# fase_3_train.py - Custom Spatial-Temporal Transformer
def build_bisindo_transformer(window_size=30, num_features=189, d_model=128):
    inputs = keras.Input(shape=(window_size, num_features))
    
    # 1. Feature Projection & Temporal Awareness
    x = layers.Dense(d_model)(inputs)
    x = x + positional_encoding(window_size, d_model)
    
    # 2. Stacking 4 Multi-Head Attention Blocks
    for i in range(4):
        x = TransformerBlock(d_model, num_heads=4, ff_dim=256)(x)
        
    x = layers.GlobalAveragePooling1D()(x)
    outputs = layers.Dense(32, activation="softmax")(x)
    
    return keras.Model(inputs, outputs)

fase_3_train.py

The Result

Utilizing Cosine Decay Restarts for dynamic learning rates, this Transformer architecture broke through a validation accuracy of >90% across 32 vocabulary classes at Epoch 100, definitively proving its superiority over LSTMs for sign language recognition

Phase 4

Hacking CoreML Limitations

Apple's Neural Engine strictly requires the `.mlpackage` format. However, converting the Keras Transformer model using the `coremltools` library (in TensorFlow 2.17) resulted in fatal crashes due to structural incompatibilities with the `SavedModel` format (the `_DictWrapper` bug).

Instead of giving up, I performed an engineering hack (`fase_4_convert.py`). Rather than saving the model to a standard file format, I wrapped the Transformer layers inside a `@tf.function` (Concrete Function), bypassed the file system serialization entirely, and injected the execution graph directly into the converter's memory.

Furthermore, I forced the computational quantization to remain in FLOAT32 (as Float16 precision would destructively distort the Attention weights) and hardcoded the normalization matrices (`norm_mean`, `norm_std`) directly into the `.mlpackage` metadata.

The Result

A high-level AI model was successfully compressed into a tiny 1.3 MB file, 100% numerically validated, and ready to be embedded natively into iOS.

# fase_4_convert.py - Bypassing SavedModel format bugs
# Wrapping the Keras model in a Concrete Function to bypass tf.saved_model
@tf.function(input_signature=[
    tf.TensorSpec(shape=(1, 30, 189), dtype=tf.float32)
])
def inference_fn(landmark_sequence):
    return wrapped_model(landmark_sequence, training=False)

# Convert directly from memory to .mlpackage (FLOAT32)
mlmodel = ct.convert(
    model=[inference_fn.get_concrete_function()],
    source="tensorflow",
    compute_precision=ct.precision.FLOAT32, # Prevent Attention distortion
    minimum_deployment_target=ct.target.iOS17
)

fase_4_convert.py

Phase 5

Real-Time UI/UX Optimization

During app testing, a fatal issue emerged: Main Thread Starvation. The iOS UI froze (lagged) for 5-10 seconds right after the user completed a correct gesture. Running 3 heavy ML models simultaneously while attempting to render UI triggered an extreme CPU bottleneck.

Three Surgical Strikes:

Dynamic ML Pausing

I implemented a semantic lock (`isTracking = false`) that instantly kills the CoreML processing circuit the exact millisecond a correct gesture is detected. This immediately frees 100% of the Neural Engine/GPU load.

Deferred Disk I/O

Shifted the SwiftData writing process (`modelContext.save()`) to a deferred background queue (`DispatchQueue.main.asyncAfter`), preventing violent collisions between SQLite locks and the SwiftUI render loop.

Lottie Eradication

Conducted deep UI profiling and discovered that initializing confetti animations via Lottie (JSON) forced the Main Thread to synchronously build thousands of `CALayer`s. I eradicated them and replaced them with ultra-lightweight Native SwiftUI animations.

// AppViewModel.swift - Main Thread Starvation Fix
func handleCorrectGesture(for word: String) {
    // 1. Instantly kill CoreML pipeline to free 100% CPU/GPU resources
    self.isTracking = false 
    
    // 2. Defer SwiftData Disk I/O to prevent SQLite write-locks 
    // from colliding with SwiftUI Victory Screen render loop
    DispatchQueue.main.asyncAfter(deadline: .now() + 1.0) {
        self.saveProgressToSwiftData(word: word)
    }
}

AppViewModel.swift

The Result

App performance skyrocketed. JariLingo now executes frictionless UI transitions with touch performance that feels like buttery-smooth 120 FPS, while silently running brutal ML computations in the background.

Reflection

What I Delivered (Impact):

I successfully shipped the JariLingo MVP, a production-grade iOS application. The app delivers an addictive, gamified learning experience (habit-loops, XP, Streaks), processes everything 100% On-Device for absolute privacy, and fully supports inclusive design via `UIAccessibility` integration (VoiceOver verbally reads out camera feedback for the visually impaired in real-time).

What I Gained (Personal Growth):

End-to-End Technical Mastery

I evolved from a standard iOS app developer into a full-stack AI/ML Engineer. I mastered the entire product lifecycle: from Python data engineering, Transformer architecture design, and low-level CoreML compilation, to deep Main Thread profiling in SwiftUI.

Project Management & Product Sense

I learned to ruthlessly prioritize features like a Tech Lead at Apple or Google. I made tough executive decisions (such as retaining the curriculum as local dummy data to ensure demo stability) to guarantee the Core Experience (the AI magic) was delivered flawlessly and without fail.

Empathy-Driven Engineering

Ultimately, the most sophisticated code means nothing if it doesn't solve a human problem. JariLingo is not just about Arrays and Tensors; it is a tangible manifestation of how software engineering can tear down invisible barriers and empower marginalized communities.

Final

Acknowledgements & Dataset Credit

The engineering success of JariLingo's ML pipeline is built upon the shoulders of giants. The entire Machine Learning pipeline was trained using the WL-BISINDO Dataset (Word-Level BISINDO), a revolutionary dataset comprising 1,600 RGB videos across 32 vocabulary classes of the Banten variant of BISINDO.

My deepest appreciation and gratitude go to the lead researchers: Grace Oktaviani Kindy, Glenn Leonali, and Henry Lucky, as well as the deaf sign language contributors Marvel, Tazkia, and Kevita for their dedication to building an inclusive technological ecosystem for the Deaf community in Indonesia.

Citation

If you are interested in exploring the extensive foundation of this dataset, please refer to their official academic publication:

Kindy, G. O., Leonali, G., & Lucky, H. (2025).

Word-Level BISINDO: A Novel Video Indonesian Sign Language Dataset and Baseline Methods. Procedia Computer Science, 269, 249-258. https://doi.org/10.1016/j.procs.2025.08.277

JariLingo • 2026

How I Built a Real-Time AI to Break the Sign Language Barrier

ROLE

AI Mobile Engineer
& Tech Lead

TIMELINE

May - Jun 2026

Team

Solo Developer

SKILLS

SwiftUI & SwiftData

CoreML

Python & TensorFlow

Overview

How I Built a Real-Time AI to Break the Sign Language Barrier

A Deep-Dive Technical Case Study on Engineering a Real-Time Indonesian Sign Language (BISINDO) Recognition System.

There are over 2.6 million Deaf individuals in Indonesia whose primary language is Indonesian Sign Language (BISINDO). Unfortunately, the hearing majority lacks accessible ways to learn it. Popular apps like Duolingo have successfully democratized spoken languages, but sign languages lag behind because they require real-time spatial and visual analysis, not just text or audio matching.

My mission was to democratize BISINDO learning by designing an interactive, gamified, and inclusive iOS application. The absolute requirements: The AI system must run 100% On-Device for absolute privacy, process data at 30 FPS instantly without lag, and feel as magical as interacting with a real human instructor.

Phase 1

The Video-Based ML Trap & Strategic Pivot

My initial hypothesis was to train Apple's CreateML Action Classifier using dummy video samples I recorded myself.

Testing revealed abysmal accuracy (<40%) whenever there were changes in lighting, background, or user clothing. The Action Classifier analyzes bulk pixels as a whole, rather than anatomical structures.

This failure provided a crucial insight: the AI must be blind to colors and pixels, evaluating pure mathematical skeletons (Landmarks) instead. Furthermore, I realized the absolute necessity of large-scale, peer-reviewed academic data. Consequently, I integrated the WL-BISINDO Dataset (Kindy et al., 2025) a massive research repository containing 1,600 high-quality RGB videos across 32 isolated BISINDO signs.

https://www.kaggle.com/datasets/glennleonali/wl-bisindo

Phase 2

Spatial Extraction & Kinematic Feature Engineering

Feeding raw videos directly into an AI model is bad practice. I engineered a Python data pipeline (`extract_landmarks.py`) powered by Google MediaPipe (`hand_landmarker.task` & `pose_landmarker_lite.task`).

# extract_landmarks.py - The Pivot to Skeleton-Based ML
import mediapipe as mp

# 1. Initialize Base Options for Hand & Pose Models
BaseOptions = mp.tasks.BaseOptions
VisionRunningMode = mp.tasks.vision.RunningMode

# 2. Configure Hand Landmarker for Video Processing
hand_options = mp.tasks.vision.HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='hand_landmarker.task'),
    running_mode=VisionRunningMode.VIDEO,
    num_hands=1,
    min_hand_detection_confidence=0.5
)

# 3. Create instances to extract mathematical skeletons, ignoring pixels
with mp.tasks.vision.HandLandmarker.create_from_options(hand_options) as landmarker:
    # Process 1,600 WL-BISINDO videos frame-by-frame...
    # (Extracting 21 X,Y,Z coordinates per frame)
    pass

extract_landmarks.py

I extracted 21 hand joints (63 spatial X,Y,Z coordinates) per frame, transforming 1,600 videos into a foundational CSV dataset of nearly 100 MB.

Sign language is not merely static poses; it is fundamentally about rhythm and velocity. In the `process_kinematics.py` script, I executed advanced Feature Engineering techniques:

Kalman Filter (63-dimensional)

Cleaned up visual jitter from the raw camera feed.

Derived Kinematics

Calculated Velocity (63 features) and Acceleration (63 features) between consecutive frames.

Temporal Windowing

Standardized the data sequences into a fixed time window of exactly 30 frames.

# process_kinematics.py - The Core Logic
# 1. Denoising raw coordinates using 63-dimensional Kalman Filter
kf = KalmanFilter(initial_state_mean=coords[0], n_dim_obs=63)
smoothed_coords, _ = kf.smooth(coords)

# 2. Deriving Kinematics (Velocity & Acceleration)
velocity = np.zeros_like(smoothed_coords)
velocity[1:] = smoothed_coords[1:] - smoothed_coords[:-1]

acceleration = np.zeros_like(velocity)
acceleration[1:] = velocity[1:] - velocity[:-1]

# 3. Concatenating into a 189-dimensional Feature Tensor
full_features = np.concatenate([smoothed_coords, velocity, acceleration], axis=1)

process_kinematics.py

This rigorous process successfully converted billions of video pixels into pure mathematical tensors of `(N, 30, 189)` dimensions (`X_data_fase_2.npy`), dramatically increasing the Signal-to-Noise Ratio for machine readability.

Phase 3

Training the "Brain" (Transitioning from LSTM to Transformer)

Traditional neural networks like LSTMs (Long Short-Term Memory) struggled to capture the long-range temporal dependencies of complex BISINDO signs where the beginning and end of a gesture heavily dictate its meaning.

I custom-designed and trained a Spatial-Temporal Transformer architecture from scratch using TensorFlow/Keras (`fase_3_train.py`).

I Implemented Sinusoidal Positional Encoding (following the legendary "Attention Is All You Need" paper) to inject temporal awareness into the AI.

Stacked 4 Multi-Head Attention blocks (`NUM_HEADS=4`, `D_MODEL=128`) ending with GlobalAveragePooling1D.

# fase_3_train.py - Custom Spatial-Temporal Transformer
def build_bisindo_transformer(window_size=30, num_features=189, d_model=128):
    inputs = keras.Input(shape=(window_size, num_features))
    
    # 1. Feature Projection & Temporal Awareness
    x = layers.Dense(d_model)(inputs)
    x = x + positional_encoding(window_size, d_model)
    
    # 2. Stacking 4 Multi-Head Attention Blocks
    for i in range(4):
        x = TransformerBlock(d_model, num_heads=4, ff_dim=256)(x)
        
    x = layers.GlobalAveragePooling1D()(x)
    outputs = layers.Dense(32, activation="softmax")(x)
    
    return keras.Model(inputs, outputs)

fase_3_train.py

The Result

Utilizing Cosine Decay Restarts for dynamic learning rates, this Transformer architecture broke through a validation accuracy of >90% across 32 vocabulary classes at Epoch 100, definitively proving its superiority over LSTMs for sign language recognition

Phase 4

Hacking CoreML Limitations

Apple's Neural Engine strictly requires the `.mlpackage` format. However, converting the Keras Transformer model using the `coremltools` library (in TensorFlow 2.17) resulted in fatal crashes due to structural incompatibilities with the `SavedModel` format (the `_DictWrapper` bug).

Instead of giving up, I performed an engineering hack (`fase_4_convert.py`). Rather than saving the model to a standard file format, I wrapped the Transformer layers inside a `@tf.function` (Concrete Function), bypassed the file system serialization entirely, and injected the execution graph directly into the converter's memory.

Furthermore, I forced the computational quantization to remain in FLOAT32 (as Float16 precision would destructively distort the Attention weights) and hardcoded the normalization matrices (`norm_mean`, `norm_std`) directly into the `.mlpackage` metadata.

The Result

A high-level AI model was successfully compressed into a tiny 1.3 MB file, 100% numerically validated, and ready to be embedded natively into iOS.

# fase_4_convert.py - Bypassing SavedModel format bugs
# Wrapping the Keras model in a Concrete Function to bypass tf.saved_model
@tf.function(input_signature=[
    tf.TensorSpec(shape=(1, 30, 189), dtype=tf.float32)
])
def inference_fn(landmark_sequence):
    return wrapped_model(landmark_sequence, training=False)

# Convert directly from memory to .mlpackage (FLOAT32)
mlmodel = ct.convert(
    model=[inference_fn.get_concrete_function()],
    source="tensorflow",
    compute_precision=ct.precision.FLOAT32, # Prevent Attention distortion
    minimum_deployment_target=ct.target.iOS17
)

fase_4_convert.py

Phase 5

Real-Time UI/UX Optimization

During app testing, a fatal issue emerged: Main Thread Starvation. The iOS UI froze (lagged) for 5-10 seconds right after the user completed a correct gesture. Running 3 heavy ML models simultaneously while attempting to render UI triggered an extreme CPU bottleneck.

Three Surgical Strikes:

Dynamic ML Pausing

Deferred Disk I/O

Lottie Eradication

// AppViewModel.swift - Main Thread Starvation Fix
func handleCorrectGesture(for word: String) {
    // 1. Instantly kill CoreML pipeline to free 100% CPU/GPU resources
    self.isTracking = false 
    
    // 2. Defer SwiftData Disk I/O to prevent SQLite write-locks 
    // from colliding with SwiftUI Victory Screen render loop
    DispatchQueue.main.asyncAfter(deadline: .now() + 1.0) {
        self.saveProgressToSwiftData(word: word)
    }
}

AppViewModel.swift

The Result

App performance skyrocketed. JariLingo now executes frictionless UI transitions with touch performance that feels like buttery-smooth 120 FPS, while silently running brutal ML computations in the background.

Reflection

What I Delivered (Impact):

I successfully shipped the JariLingo MVP, a production-grade iOS application. The app delivers an addictive, gamified learning experience (habit-loops, XP, Streaks), processes everything 100% On-Device for absolute privacy, and fully supports inclusive design via `UIAccessibility` integration (VoiceOver verbally reads out camera feedback for the visually impaired in real-time).

What I Gained (Personal Growth):

End-to-End Technical Mastery

Project Management & Product Sense

Empathy-Driven Engineering

Final

Acknowledgements & Dataset Credit

The engineering success of JariLingo's ML pipeline is built upon the shoulders of giants. The entire Machine Learning pipeline was trained using the WL-BISINDO Dataset (Word-Level BISINDO), a revolutionary dataset comprising 1,600 RGB videos across 32 vocabulary classes of the Banten variant of BISINDO.

My deepest appreciation and gratitude go to the lead researchers: Grace Oktaviani Kindy, Glenn Leonali, and Henry Lucky, as well as the deaf sign language contributors Marvel, Tazkia, and Kevita for their dedication to building an inclusive technological ecosystem for the Deaf community in Indonesia.

Citation

If you are interested in exploring the extensive foundation of this dataset, please refer to their official academic publication:

Kindy, G. O., Leonali, G., & Lucky, H. (2025).

Word-Level BISINDO: A Novel Video Indonesian Sign Language Dataset and Baseline Methods. Procedia Computer Science, 269, 249-258. https://doi.org/10.1016/j.procs.2025.08.277

JariLingo • 2026

How I Built a Real-Time AI to Break the Sign Language Barrier

ROLE

AI Mobile Engineer & Tech Lead

TIMELINE

May - Jun 2026

Team

Solo Developer

SKILLS

SwiftUI & SwiftData

CoreML

Python & TensorFlow

Overview

How I Built a Real-Time AI to Break the Sign Language Barrier

A Deep-Dive Technical Case Study on Engineering a Real-Time Indonesian Sign Language (BISINDO) Recognition System.

Phase 1

The Video-Based ML Trap & Strategic Pivot

My initial hypothesis was to train Apple's CreateML Action Classifier using dummy video samples I recorded myself.

Testing revealed abysmal accuracy (<40%) whenever there were changes in lighting, background, or user clothing. The Action Classifier analyzes bulk pixels as a whole, rather than anatomical structures.

https://www.kaggle.com/datasets/glennleonali/wl-bisindo

Phase 2

Spatial Extraction & Kinematic Feature Engineering

Feeding raw videos directly into an AI model is bad practice. I engineered a Python data pipeline (`extract_landmarks.py`) powered by Google MediaPipe (`hand_landmarker.task` & `pose_landmarker_lite.task`).

extract_landmarks.py

I extracted 21 hand joints (63 spatial X,Y,Z coordinates) per frame, transforming 1,600 videos into a foundational CSV dataset of nearly 100 MB.

Sign language is not merely static poses; it is fundamentally about rhythm and velocity. In the `process_kinematics.py` script, I executed advanced Feature Engineering techniques:

process_kinematics.py

This rigorous process successfully converted billions of video pixels into pure mathematical tensors of `(N, 30, 189)` dimensions (`X_data_fase_2.npy`), dramatically increasing the Signal-to-Noise Ratio for machine readability.

Phase 3

Training the "Brain" (Transitioning from LSTM to Transformer)

Traditional neural networks like LSTMs (Long Short-Term Memory) struggled to capture the long-range temporal dependencies of complex BISINDO signs where the beginning and end of a gesture heavily dictate its meaning.

I custom-designed and trained a Spatial-Temporal Transformer architecture from scratch using TensorFlow/Keras (`fase_3_train.py`).

I Implemented Sinusoidal Positional Encoding (following the legendary "Attention Is All You Need" paper) to inject temporal awareness into the AI.

Stacked 4 Multi-Head Attention blocks (`NUM_HEADS=4`, `D_MODEL=128`) ending with GlobalAveragePooling1D.

fase_3_train.py

The Result

Utilizing Cosine Decay Restarts for dynamic learning rates, this Transformer architecture broke through a validation accuracy of >90% across 32 vocabulary classes at Epoch 100, definitively proving its superiority over LSTMs for sign language recognition

Phase 4

Hacking CoreML Limitations

Apple's Neural Engine strictly requires the `.mlpackage` format. However, converting the Keras Transformer model using the `coremltools` library (in TensorFlow 2.17) resulted in fatal crashes due to structural incompatibilities with the `SavedModel` format (the `_DictWrapper` bug).

Furthermore, I forced the computational quantization to remain in FLOAT32 (as Float16 precision would destructively distort the Attention weights) and hardcoded the normalization matrices (`norm_mean`, `norm_std`) directly into the `.mlpackage` metadata.

The Result

A high-level AI model was successfully compressed into a tiny 1.3 MB file, 100% numerically validated, and ready to be embedded natively into iOS.

fase_4_convert.py

Phase 5

Real-Time UI/UX Optimization

During app testing, a fatal issue emerged: Main Thread Starvation. The iOS UI froze (lagged) for 5-10 seconds right after the user completed a correct gesture. Running 3 heavy ML models simultaneously while attempting to render UI triggered an extreme CPU bottleneck.

Three Surgical Strikes:

AppViewModel.swift

The Result

App performance skyrocketed. JariLingo now executes frictionless UI transitions with touch performance that feels like buttery-smooth 120 FPS, while silently running brutal ML computations in the background.

Reflection

What I Delivered (Impact):

What I Gained (Personal Growth):

Final

Acknowledgements & Dataset Credit

JariLingo • 2026

How I Built a Real-Time AI to Break the Sign Language Barrier

ROLE

AI Mobile Engineer & Tech Lead

TIMELINE

May - Jun 2026

Team

Solo Developer

SKILLS

SwiftUI & SwiftData

CoreML

Python & TensorFlow

Overview

How I Built a Real-Time AI to Break the Sign Language Barrier

A Deep-Dive Technical Case Study on Engineering a Real-Time Indonesian Sign Language (BISINDO) Recognition System.

Phase 1

The Video-Based ML Trap & Strategic Pivot

My initial hypothesis was to train Apple's CreateML Action Classifier using dummy video samples I recorded myself.

Testing revealed abysmal accuracy (<40%) whenever there were changes in lighting, background, or user clothing. The Action Classifier analyzes bulk pixels as a whole, rather than anatomical structures.

https://www.kaggle.com/datasets/glennleonali/wl-bisindo

Phase 2

Spatial Extraction & Kinematic Feature Engineering

Feeding raw videos directly into an AI model is bad practice. I engineered a Python data pipeline (`extract_landmarks.py`) powered by Google MediaPipe (`hand_landmarker.task` & `pose_landmarker_lite.task`).

extract_landmarks.py

AI Mobile Engineer
& Tech Lead

AI Mobile Engineer
& Tech Lead