Role overview
We are seeking a highly skilled LLM Engineer to assist in the development of a multi-modal Large Language Model (LLM) pipeline for digitizing geotechnical bore log data. This role is critical to transforming unstructured PDF documents into structured, machine-readable JSON outputs that support downstream analytics, GIS integration, and AI-powered search.
You will work closely with a Project Manager and technical stakeholders at our customer to build, fine-tune, and evaluate a custom LLM solution capable of interpreting complex geotechnical documents across multiple vendors.
What you'll work on
Phase 1 –
Pilot Development
- Fine-tune a multi-modal LLM (e.g., Pixtral-12B, PaliGemma, Gemma 3) using annotated bore log PDFs and JSON samples.
- Build preprocessing pipelines for: Page segmentation, Figure isolation, Normalization of units and soil classification.
- Develop and implement an evaluation framework including Precision/Recall/F1, domain-specific metrics, and JSON schema conformance.
Cross-Vendor Generalization
- Test model generalization on bore logs from 3 additional vendors.
- Identify and categorize failure cases.
- Compare performance across vendors and recommend strategies for scaling.
Pipeline Packaging & Handoff
- Package preprocessing scripts, model artifacts, and evaluation dashboards into a reproducible workflow.
- Deliver structured JSON outputs and final benchmark reports.
- Provide all source code and documentation for handoff.
What we're looking for
- Experience with geotechnical or engineering datasets
- Familiarity with MongoDB, vector search, and embedding-based retrieval
- Exposure to MLOps practices and CI/CD for ML pipelines
- Prior work in AI document ingestion or enterprise-scale data transformation