Role overview
We are seeking a highly skilled LLM Engineer to assist in the development of a multi-modal Large Language Model (LLM) pipeline for digitizing geotechnical bore log data. This role is critical to transforming unstructured PDF documents into structured, machine-readable JSON outputs that support downstream analytics, GIS integration, and AI-powered search.
You will work closely with a Project Manager and technical stakeholders at our customer to build, fine-tune, and evaluate a custom LLM solution capable of interpreting complex geotechnical documents across multiple vendors.
Responsibilities
- Fine-tune a multi-modal LLM (e.g., Pixtral-12B, PaliGemma, Gemma 3) using annotated bore log PDFs and JSON samples.
- Build preprocessing pipelines for: Page segmentation, Figure isolation, Normalization of units and soil classification.
- Develop and implement an evaluation framework including Precision/Recall/F1, domain-specific metrics, and JSON schema conformance.
- Test model generalization on bore logs from 3 additional vendors.
- Identify and categorize failure cases.
- Compare performance across vendors and recommend strategies for scaling.
- Package preprocessing scripts, model artifacts, and evaluation dashboards into a reproducible workflow.
- Deliver structured JSON outputs and final benchmark reports.
- Provide all source code and documentation for handoff.
Basic qualifications
- Proven experience fine-tuning and deploying multi-modal LLMs (e.g., Pixtral, LLaMA, Gemma, etc.)
- Ollama/llama.ccp, mongodb/non-relational dbs, and ai coding tools (cursor/windsurf/co-pilot.) experience.
- Experience using OSS models
- Strong proficiency in Python and ML frameworks (e.g., PyTorch, TensorFlow)
- Experience with OCR, image preprocessing (OpenCV), and document parsing
- Familiarity with geospatial data and JSON schema design
- Ability to work with GPU environments (e.g., A100s) and cloud-based training setups
- Strong understanding of evaluation metrics and model benchmarking
- Excellent communication and documentation skills
- Experience with geotechnical or engineering datasets
- Familiarity with MongoDB, vector search, and embedding-based retrieval
- Exposure to MLOps practices and CI/CD for ML pipelines
- Prior work in AI document ingestion or enterprise-scale data transformation