Role overview
**Overview**
We are seeking a Senior Software Development Engineer in Test to support one of our premier client's Embedded Intelligence team. This role will combine traditional software testing with advanced AI evaluation practices to ensure that Large Language Models (LLMs) are reliable, accurate, and aligned with how Operators and Team Members naturally interact with technology. The ideal candidate will be hands-on in designing automated test suites, evaluating model behavior, and collaborating across product, data science, and engineering to define what “good” looks like in model performance.
**Key Responsibilities**
Design, build, and execute automated test suites to evaluate LLM responses across diverse real-world scenarios.
Apply AI evaluation frameworks (e.g., DeepEval, LangSmith, TruLens, etc.) to measure accuracy, relevance, coherence, and consistency.
Develop and maintain prompt libraries and representative datasets reflecting restaurant and Operator-level interactions.
Partner with product managers, data scientists, and software engineers to analyze evaluation results, identify gaps, and enhance LLM reliability.
Define and track key performance metrics for model quality, safety, and production readiness.
Continuously research and integrate new tools and methodologies for AI model validation and test automation.
Contribute to building a repeatable LLM testing framework that enables scalable evaluation across new model versions and domains.
**Qualifications**
Bachelor’s degree in Computer Science, Data Science, or a related field (or equivalent experience).
4+ years of experience in software testing, test automation, or quality engineering.
Proven ability to develop automated tests using modern frameworks (e.g., PyTest, Playwright, or similar).
Direct experience with LLM evaluation or prompt testing, including tools like DeepEval, LangChain, or LangSmith.
Proficiency in Python and familiarity with API testing and data validation techniques.
Strong analytical, debugging, and problem-solving skills.
Excellent communication and collaboration abilities; capable of translating technical insights into actionable outcomes for cross-functional teams.
Curious, experimental mindset, comfortable working in emerging AI domains with evolving standards.
**Preferred Skills**
Experience with AI/ML pipelines or MLOps tools (e.g., Databricks, Vertex AI, SageMaker).
Familiarity with evaluation metrics for model performance (BLEU, ROUGE, BERTScore, relevance scoring, etc.).
Exposure to cloud infrastructure (AWS, GCP, Azure) for model deployment and testing.
Prior experience in enterprise or production-scale LLM applications.