Pragmatic Paths to On-Device AI on Android with ML Kit

Use ML Kit to add on-device AI (Text/Barcode/OCR, Object & Pose, Translation) with simple Kotlin APIs-fast, offline, private.

Mohan Sankaran

Jan. 12, 26 · Tutorial

Likes (6)

Comment

Save

1.9K Views

There isn’t a single canonical way to add on-device AI to Android apps. Your ideal path depends on latency, privacy, UX, and maintainability. Google’s ML Kit gives you interchangeable building blocks — text recognition, barcode scanning, object/pose detection, translation, and more — that you can compose to fit your constraints. This guide lays out a pragmatic architecture, drop-in code, and a performance checklist you can ship in a sprint. The theme is intentional minimalism: pick one capability, wrap it behind a tiny interface, wire it to CameraX if needed, and iterate with metrics instead of speculative complexity.

When ML Kit Is the Smart Choice

On-device by default: You get low latency, offline reliability, and strong privacy because images and text don’t need to leave the device for common tasks. This dramatically reduces legal/compliance risk and eliminates network tail latency that can frustrate users during capture flows.
Production-hardened models: The bundled models handle rotation, noise, motion blur, and imperfect lighting better than most “roll-your-own” attempts. You benefit from years of tuning without owning a training pipeline.
Modular adoption: Add exactly one capability at a time; you don’t need a model server, autoscaling, or a feature-flagged rollout of custom models. That simplicity keeps your blast radius small.
Great Android ergonomics: ML Kit works cleanly with CameraX, coroutines, and lifecycle components. That means less boilerplate and fewer foot-guns when you integrate with the camera stack, orientation changes, or backgrounding/foregrounding transitions.

Common wins:

Text Recognition for receipts, forms, and serials
Barcode Scanning for QR/retail codes, tickets, and boarding passes
Object Detection & Tracking for AR-lite highlights and tap-to-focus interactions
Pose Detection / Selfie Segmentation for fitness and background effects
Language ID + Translation for chat and travel scenarios

Project Setup (minimal friction)

app/build.gradle:

    Groovy
   
 

   dependencies {
    // Choose only what you need:
    implementation "com.google.mlkit:text-recognition:latest-version"
    implementation "com.google.mlkit:barcode-scanning:latest-version"
    implementation "com.google.mlkit:object-detection:latest-version"

    // CameraX
    def camerax = "1.3.4"
    implementation "androidx.camera:camera-core:$camerax"
    implementation "androidx.camera:camera-camera2:$camerax"
    implementation "androidx.camera:camera-lifecycle:$camerax"
    implementation "androidx.camera:camera-view:$camerax"

    // Coroutines interop with Google Tasks
    implementation "org.jetbrains.kotlinx:kotlinx-coroutines-play-services:1.7.3"
}

  

Versioning tip: Use Gradle version catalogs and bump dependencies on a release train, not ad hoc.

Pattern 1: Still-Image Text Recognition (clean, testable)

    Kotlin
   
 

   import android.graphics.Bitmap
import com.google.mlkit.vision.common.InputImage
import com.google.mlkit.vision.text.TextRecognition
import com.google.mlkit.vision.text.latin.TextRecognizerOptions
import kotlinx.coroutines.tasks.await

class TextReader : AutoCloseable {
    private val client = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)

    suspend fun read(bitmap: Bitmap): String {
        val image = InputImage.fromBitmap(bitmap, 0)
        val result = client.process(image).await()
        return result.text.trim()
    }

    override fun close() = client.close()
}

  

Why this scales: Keep ML Kit behind a tiny API you can fake in tests. Normalize rotation at the boundary and return domain objects (e.g., ReceiptFields) rather than raw strings.

Pattern 2: Real-Time CameraX -> Analyzer (live capture)

    Kotlin
   
 

   import androidx.camera.core.ImageAnalysis
import androidx.camera.core.ImageProxy
import com.google.mlkit.vision.common.InputImage
import com.google.mlkit.vision.text.TextRecognition
import com.google.mlkit.vision.text.latin.TextRecognizerOptions

class LiveTextAnalyzer(
    private val onText: (String) -> Unit
) : ImageAnalysis.Analyzer {

    private val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)

    override fun analyze(imageProxy: ImageProxy) {
        val mediaImage = imageProxy.image ?: return imageProxy.close()
        val rotation = imageProxy.imageInfo.rotationDegrees
        val image = InputImage.fromMediaImage(mediaImage, rotation)

        recognizer.process(image)
            .addOnSuccessListener { onText(it.text) }
            .addOnCompleteListener { imageProxy.close() } // always close
    }
}

  

UX polish that users feel: A framing hint (“Align the code inside the box”), a subtle haptic on success, and throttled overlay updates (e.g., 150–250 ms) to avoid flicker.

Pattern 3: Object Detection & Tracking (multi-object, optional labels)

    Kotlin
   
 

   import com.google.mlkit.vision.objects.ObjectDetection
import com.google.mlkit.vision.objects.defaults.ObjectDetectorOptions

val detector = ObjectDetection.getClient(
    ObjectDetectorOptions.Builder()
        .setDetectorMode(ObjectDetectorOptions.STREAM_MODE)
        .enableMultipleObjects()
        .enableClassification() // coarse labels like "Food", "Home good"
        .build()
)

  

Draw rounded rects with stable IDs so users can see continuity across frames. Maintain a simple tracker map to manage per-object UI state.

Security, Privacy, and Accessibility (professional baseline)

Privacy UX: Place “Processed on your device; nothing uploaded” near the capture action (not buried in settings).
Permission education: Explain why you need camera access before the system dialog.
A11y: Announce detections via TalkBack, provide a manual capture button, respect reduced-motion, and avoid focus thrash.
Failure design: Time out gracefully, show a retry affordance, and debounce repeated attempts.

Testing & Observability (so it doesn’t regress)

Interfaces > implementations: Hide ML Kit behind Repository/UseCase ports and use fakes in unit tests.
Golden inputs: Keep a tiny suite of canonical images (good/low light, rotated, blurred). Assert on parsed fields, not raw strings.
Cold-start metrics: Track detector init, time-to-first-result, and analyzer throughput (p50/p95).
Sampled logs: Log consecutive failures and recovery; keep SLOs honest.

Performance Checklist (drop into your PR)

ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST to prevent frame backlogs.
Start/stop analyzers with lifecycle; no invisible background work.
Reuse recognizers/detectors; close them when screens disappear.
Downscale frames when 4K isn’t necessary for the task.
Throttle overlay updates; no heavy work on the main thread.
Back off on repeated failures (exponential or capped linear).

Hybrid Approaches (when you need domain specificity)

There’s no rule that everything must be on-device. A pragmatic flow:

Use ML Kit to quickly localize candidates on-device.
With explicit consent, send cropped regions to a server model for high-recall verification.
Cache results and translation packs so the user experience degrades gracefully offline.

Takeaway

There are multiple valid paths to ship intelligent camera and language features on Android. ML Kit’s modular APIs let you choose the composition that fits your latency, privacy, and UX goals-without the drag of model hosting.

Start with one capability (text or barcodes), wrap it behind a clean use-case interface, wire up CameraX, and iterate with the checklist above. You’ll deliver meaningful AI in a single release cycle — safe, measurable, and maintainable.

AI Android (robot)

Opinions expressed by DZone contributors are their own.

Related

Trending