ArXiv Contrastive Embeddings¶

Fine tune Sentence Transformers models on arXiv titles and abstracts to produce embeddings that cluster papers by primary subject and power similarity search.

Project summary¶

We build pair datasets from arXiv metadata for contrastive learning, fine tune sentence transformer models across multiple backbones, and evaluate semantic retrieval alongside embedding-based classification. The system builds a FAISS similarity search index, supports ONNX Runtime for faster inference paths, and serves embeddings behind a FastAPI endpoint for downstream apps and retrieval services.

The pipeline starts by downloading and splitting the arXiv dataset, then builds contrastive pairs from primary subjects. We train sentence transformers with the selected loss, evaluate precision@k alongside classifier metrics, compare TF-IDF baselines, export ONNX models, and build a FAISS index before serving embeddings through the API.

Stack¶

A quick snapshot of the tooling that anchors data, training, automation, and serving across the project.

Data + versioning

Hugging Face Datasets DVC logo

DVC

GCS

Training + tracking

PyTorch + Sentence Transformers Weights and Biases logo

Weights & Biases Hydra logo

Hydra

Source + automation

GitHub

GitHub Actions Docker logo

Docker

Cloud Build Google Cloud logo

Vertex AI

Artifact Registry / Container Registry

Serving + retrieval

Cloud Run

FastAPI

ONNX Runtime Meta logo

FAISS

Get started¶

This repository uses uv for package and project management.

uv sync --dev
uv run python src/mlops_project/data.py
uv run python src/mlops_project/train.py

Key docs¶

Start with Data and preprocessing, then follow Training and Evaluation. Deployment details live in API and deployment, and configuration lives in Configuration.