App Development Blog

Harnessing ML Kit for Fast, Accurate On-Device Text Recognition

July 11, 2025
By Sheharyar

Big thanks to our contributors those make our blogs possible.

Our growing community of contributors bring their unique insights from around the world to power our blog.

Introduction

Imagine your mobile app instantly scanning a business card, extracting the name, phone, and email, and auto-populating a contact form—without ever touching a server. On-device text recognition with Google’s ML Kit makes this possible, delivering high accuracy and lightning-fast responses even when offline. By leveraging an optimized TensorFlow Lite model under the hood, ML Kit offers out-of-the-box APIs for both Android and iOS that handle varied lighting, orientations, and fonts. In this guide, we’ll explore how to integrate ML Kit’s Text Recognition API, optimize performance, handle multilingual text, and structure your code for maintainability. Whether you’re building a receipt scanner, document archiver, or real-time translation tool, you’ll have everything you need to get started in minutes.

1. Why On-Device Text Recognition Matters

Instantaneous feedback: No network latency means results in milliseconds.
Privacy by design: User images never leave the device—ideal for sensitive data like IDs and prescriptions.
Offline functionality: Your app continues to work in airplane mode or low-connectivity areas.
Cost savings: Eliminates server-side OCR fees and data transfer charges.

Real-World Analogy: Think of on-device recognition like having a mini-photocopier and librarian in the user’s pocket—efficient, private, and always available.

2. ML Kit Text Recognition Overview

2.1 Two Flavors of the API

On-Device Text Recognition
- Supports Latin alphabet by default (English, French, German, etc.).
- No network call, minimal setup.
Cloud Text Recognition
- Broader language support and handwriting recognition.
- Requires Firebase project and internet connection.

For speed, cost-efficiency, and simplicity, we’ll focus on the on-device variant.

2.2 Key Features

Text blocks, lines, and elements: Hierarchical structure for fine-grained parsing.
Bounding boxes & corner points: Precise coordinates for overlaying UI.
Confidence scores: Assess reliability of each recognized element.
Language hints: Improve accuracy when you know the expected language.

3. Setting Up ML Kit in Your Project

Android (Kotlin) Integration

Add the dependency to your app/build.gradle: groovyCopyEditimplementation 'com.google.mlkit:text-recognition:16.0.0'
Ensure you’re using AndroidX and target SDK ≥ 21.
Request camera and storage permissions in your AndroidManifest.xml: xmlCopyEdit<uses-permission android:name="android.permission.CAMERA"/> <uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE"/>

iOS (Swift) Integration

Add MLKitTextRecognition via CocoaPods in your Podfile: rubyCopyEditpod 'GoogleMLKit/TextRecognition', '3.1.0'
Run pod install and open your .xcworkspace.
Enable camera usage in Info.plist: xmlCopyEdit<key>NSCameraUsageDescription</key> <string>Used for scanning documents and receipts</string>

4. Capturing Images for OCR

Best Practices for Image Quality

Stabilize the camera: Encourage users to brace their device or use a stand.
Good lighting: Avoid harsh backlight and glare.
Avoid skew: Aim for a rectangular crop of the document.

Sample CameraX (Android) Implementation

kotlinCopyEditval cameraProviderFuture = ProcessCameraProvider.getInstance(context)
cameraProviderFuture.addListener({
  val cameraProvider = cameraProviderFuture.get()
  val preview = Preview.Builder().build().also {
    it.setSurfaceProvider(previewView.surfaceProvider)
  }
  val imageAnalysis = ImageAnalysis.Builder()
    .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
    .build()
    .also {
      it.setAnalyzer(executor) { imageProxy ->
        val mediaImage = imageProxy.image ?: return@setAnalyzer
        processImage(mediaImage, imageProxy.imageInfo.rotationDegrees)
      }
    }
  cameraProvider.bindToLifecycle(this, CameraSelector.DEFAULT_BACK_CAMERA, preview, imageAnalysis)
}, ContextCompat.getMainExecutor(context))

5. Performing On-Device Text Recognition

Core Recognition Logic

Android (Kotlin)

kotlinCopyEditprivate fun processImage(mediaImage: Image, rotation: Int) {
  val inputImage = InputImage.fromMediaImage(mediaImage, rotation)
  val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)
  recognizer.process(inputImage)
    .addOnSuccessListener { visionText ->
      extractText(visionText)
    }
    .addOnFailureListener { e ->
      Log.e(TAG, "Text recognition failed", e)
    }
}

iOS (Swift)

swiftCopyEditfunc recognizeText(in image: UIImage) {
  let visionImage = VisionImage(image: image)
  let options = TextRecognizerOptions()
  let textRecognizer = TextRecognizer.textRecognizer(options: options)
  textRecognizer.process(visionImage) { result, error in
    guard error == nil, let result = result else {
      print("Text recognition error: \(error!)")
      return
    }
    self.extractText(result)
  }
}

Parsing the Results

Both platforms return a hierarchical data structure:

kotlinCopyEditfun extractText(result: Text) {
  for (block in result.textBlocks) {
    for (line in block.lines) {
      for (element in line.elements) {
        Log.d(TAG, "Text: ${element.text}, Bounds: ${element.boundingBox}")
      }
    }
  }
}

Use block.boundingBox and element.cornerPoints to draw overlays on a Canvas or CALayer.

6. Optimizing Performance

Throttle Analyses

Frame skipping: Analyze every nth frame (e.g., every 5th) when scanning continuous camera feed.
Use STRATEGY_KEEP_ONLY_LATEST: Discards outdated frames.

Crop to ROI (Region of Interest)

If you know where text appears (e.g., business card area), crop the InputImage to a tighter rectangle.

kotlinCopyEditval roi = Rect(50, 200, 1000, 600)
val croppedBitmap = Bitmap.createBitmap(bitmap, roi.left, roi.top, roi.width(), roi.height())
val inputImage = InputImage.fromBitmap(croppedBitmap, rotation)

Reuse Recognizer Instances

Instantiate TextRecognizer once and reuse to avoid repeated model loading.

7. Handling Multilingual Text

Language Hints on Android

Currently, ML Kit’s on-device API is optimized for Latin scripts. For mixed-language or non-Latin text, consider the cloud API or fallback libraries.

Post-Processing and Language Detection

Google’s ML Kit Language ID module can detect the language of a text block.
Rule-Based Formatting: For dates, phone numbers, or postal codes, use regex to parse recognized text into structured fields.

8. Error Handling and Edge Cases

Low Confidence Filtering

Each Text.Element has a confidence score (0.0–1.0). Discard elements below a threshold (e.g., 0.6) to reduce false positives.

Partial Recognition

If only part of the desired text is recognized:

Prompt the user to reposition the document.
Aggregate multiple frames: Keep a rolling buffer of recognized text across frames, merging duplicates.

9. UX Considerations

Visual Guides and Overlays

Outline the target area: Show a translucent rectangle on the camera preview where text is expected.
Live feedback: Display a “Scanning…” indicator and show recognized text in real time.

User Controls

Flash toggle: For low-light conditions.
Capture button: Allow user-initiated capture instead of continuous scanning.
Retry option: Let users retake images if results aren’t satisfactory.

10. Advanced Use Cases and Integrations

Automatic Form Filling

Extract and map fields: Match recognized text against form labels (e.g., “Name,” “Email”) using simple keyword matching.
UI population: Auto-populate EditText or UITextField controls with recognized values.

Real-Time Translation

Recognize text on-device.
Send recognized string to a translation API (e.g., Google Translate).
Overlay translated text back on the camera preview.

Document Archiving

Batch scanning: Recognize and save multiple pages in a single session.
PDF generation: Stitch bits of recognized images into a searchable PDF using libraries like PDFBox (Android) or PDFKit (iOS).

Conclusion

ML Kit’s on-device text recognition empowers you to build fast, private, and reliable OCR features with minimal setup. By following best practices—throttling analysis, cropping to regions of interest, reusing recognizer instances, and providing clear UX guides—you can deliver a polished scanning experience. From business-card readers to real-time translators, the possibilities are endless. Start with the sample code above, iterate on performance optimizations, and don’t forget to handle edge cases like low-confidence text or partial captures. With on-device OCR at your fingertips, you’ll delight users with apps that truly understand the world around them—right from their pocket.

Let's connect on TikTok

@softwarehouseau

Join our newsletter to stay updated

Sheharyar

Sydney Based Software Solutions Professional who is crafting exceptional systems and applications to solve a diverse range of problems for the past 10 years.