AI Text-Based Photo Search for Mobile App

BLACKSPARC.TECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

Development and support of all types of mobile applications:

Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1All 1735 services
AI Text-Based Photo Search for Mobile App
Complex
~1-2 weeks
Frequently Asked Questions

Our competencies:

Development stages

Latest works

  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    792
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    671
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1097
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    969
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    914
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    495

AI-Powered Text-Based Photo Search for Mobile Apps

"Show me photos with a dog on the beach" — user describes in text, app finds relevant photos. This is CLIP (Contrastive Language-Image Pretraining from OpenAI): a model trained to align images and text descriptions in shared vector space. Cosine similarity between text vector and image vector is "relevance".

Architecture: Embeddings + Vector Search

Pipeline has two independent stages:

Indexing (happens once for entire gallery, then incrementally):

  • For each photo → CLIP Image Embedding (512-dimensional vector)
  • Save to local vector database

Search (happens on each user query):

  • User query → CLIP Text Embedding (same 512-dimensional vector)
  • ANN-search nearest vectors in database
  • Return photos by descending cosine similarity

CLIP On-Device via CoreML

Apple didn't include CLIP in standard Vision framework, but Apple ML Research released ml-mobileclip — a distilled version for mobile devices. MobileCLIP-S0: 18 MB, 3–5 ms inference per image on iPhone 14.

import CoreML

class MobileCLIPEmbedder {
    private let imageEncoder: MobileCLIPImageEncoder
    private let textEncoder: MobileCLIPTextEncoder

    func embedImage(_ cgImage: CGImage) throws -> [Float] {
        let resized = resize(cgImage, to: CGSize(width: 256, height: 256))
        let input = MobileCLIPImageInput(image: MLMultiArray(from: resized))
        let output = try imageEncoder.prediction(input: input)
        return l2Normalize(output.embedding.toFloatArray())
    }

    func embedText(_ query: String) throws -> [Float] {
        let tokens = tokenize(query)  // BPE tokenizer
        let input = MobileCLIPTextInput(tokens: MLMultiArray(from: tokens))
        let output = try textEncoder.prediction(input: input)
        return l2Normalize(output.embedding.toFloatArray())
    }
}

CLIP tokenizer is BPE (Byte Pair Encoding). Swift implementation available in apple/ml-mobileclip repository.

On Android: ONNX Runtime with MobileCLIP — less convenient but works. OrtEnvironment + OrtSession, batch 8 images.

Vector Database On Device

Searching among 50,000 vectors needs ANN index. Options:

SQLite with sqlite-vss extension — adds virtual tables for vector search. Compact, works embedded:

CREATE VIRTUAL TABLE photo_embeddings USING vss0(embedding(512));
INSERT INTO photo_embeddings(rowid, embedding) VALUES (42, json('[0.1, -0.3, ...]'));
SELECT rowid, distance FROM photo_embeddings WHERE vss_search(embedding, json('[0.2, -0.1, ...]')) LIMIT 20;

Simple FAISS (C++) via JNI/Swift bridging — faster at scale, harder to integrate.

Simple flat L2/cosine via Accelerate — for galleries up to 10k photos sufficient without specialized index:

func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
    var dotProduct: Float = 0
    vDSP_dotpr(a, 1, b, 1, &dotProduct, vDSP_Length(a.count))
    return dotProduct  // After L2-normalization = cosine similarity
}

Iterate through 10,000 512-dimensional vectors on iPhone 14 via vDSP_dotpr — ~15 ms. For galleries up to 20k acceptable.

Background Indexing

First indexing of 10k photo gallery at 4 ms/photo = 40 seconds. Run via BGProcessingTask:

// Save progress — resume from checkpoint on next launch
class GalleryIndexer {
    private var lastIndexedDate: Date {
        get { UserDefaults.standard.object(forKey: "lastIndexedDate") as? Date ?? .distantPast }
        set { UserDefaults.standard.set(newValue, forKey: "lastIndexedDate") }
    }

    func indexNewPhotos() async {
        let fetchOptions = PHFetchOptions()
        fetchOptions.predicate = NSPredicate(format: "creationDate > %@", lastIndexedDate as CVarArg)
        let newPhotos = PHAsset.fetchAssets(with: .image, options: fetchOptions)

        newPhotos.enumerateObjects { [weak self] asset, _, _ in
            guard let self else { return }
            if let embedding = self.computeEmbedding(for: asset) {
                self.vectorDB.insert(assetId: asset.localIdentifier, embedding: embedding)
            }
        }
        lastIndexedDate = Date()
    }
}

Search: Processing Query

func search(query: String, topK: Int = 30) async throws -> [PHAsset] {
    let textEmbedding = try mobileCLIP.embedText(query)
    let results = vectorDB.search(vector: textEmbedding, limit: topK)

    let fetchOptions = PHFetchOptions()
    fetchOptions.predicate = NSPredicate(
        format: "localIdentifier IN %@",
        results.map { $0.assetId }
    )
    let assets = PHAsset.fetchAssets(with: fetchOptions)

    // Sort by relevance (order from vectorDB)
    let idToScore = Dictionary(uniqueKeysWithValues: results.map { ($0.assetId, $0.score) })
    return assets.objects(at: IndexSet(0..<assets.count))
        .sorted { idToScore[$0.localIdentifier, default: 0] > idToScore[$1.localIdentifier, default: 0] }
}

Search latency — text embedding (~5 ms) + ANN search (~15 ms) = ~20 ms. Results feel instant to user.

Multilingual Search

CLIP trained mainly on English. For Russian query "собака на пляже" (dog on beach) — quality worse than English. Solution: translate query via simple dictionary of frequent words or Google Translate API before embeddings. In practice, enough to translate 100–200 frequent queries without API.

Timelines

Basic CLIP search with flat index for galleries up to 10k — 1–1.5 weeks. Scalable implementation with ANN index, incremental updates, multilinguality, and visual search by reference photo — 3–4 weeks. Cost calculated individually.