Skip to content

ArXiv Contrastive Embeddings

Fine tune Sentence Transformers models on arXiv titles and abstracts to produce embeddings that cluster papers by primary subject and power similarity search.

Project summary

We build pair datasets from arXiv metadata for contrastive learning, fine tune sentence transformer models across multiple backbones, and evaluate semantic retrieval alongside embedding-based classification. The system builds a FAISS similarity search index, supports ONNX Runtime for faster inference paths, and serves embeddings behind a FastAPI endpoint for downstream apps and retrieval services.

The pipeline starts by downloading and splitting the arXiv dataset, then builds contrastive pairs from primary subjects. We train sentence transformers with the selected loss, evaluate precision@k alongside classifier metrics, compare TF-IDF baselines, export ONNX models, and build a FAISS index before serving embeddings through the API.

Stack

A quick snapshot of the tooling that anchors data, training, automation, and serving across the project.

Get started

This repository uses uv for package and project management.

uv sync --dev
uv run python src/mlops_project/data.py
uv run python src/mlops_project/train.py

Key docs

Start with Data and preprocessing, then follow Training and Evaluation. Deployment details live in API and deployment, and configuration lives in Configuration.