Skip to content

Python API

This page contains the auto-generated API reference for the Python modules in src/mlops_project.

Modules

mlops_project.data

ArxivPapersDataset

Arxiv papers dataset from Hugging Face.

__getitem__

__getitem__(index: int) -> dict

Return a given sample from the dataset.

__len__

__len__() -> int

Return the length of the dataset.

create_contrastive_pairs

create_contrastive_pairs(
    dataset,
    num_pairs: int = 100000,
    text_field: str = "abstract",
    seed: int = 42,
    balanced: bool = True,
)

Create positive and negative pairs for ContrastiveLoss.

Returns a dataset with columns: sentence1, sentence2, label - label=1.0 for positive pairs (same subject) - label=0.0 for negative pairs (different subjects)

Parameters:

Name Type Description Default
balanced bool

If True, each subject has equal probability of being chosen. If False, subjects are weighted by their frequency in the dataset.

True

create_pairs

create_pairs(
    dataset,
    pair_fn: Callable,
    save_path: Path,
    num_pairs: int,
    text_field: str = "abstract",
    seed: int = 42,
    balanced: bool = True,
) -> Dataset

Create and save pairs to disk.

create_positive_pairs

create_positive_pairs(
    dataset,
    num_pairs: int = 100000,
    text_field: str = "abstract",
    seed: int = 42,
    balanced: bool = True,
)

Create positive pairs for MultipleNegativesRankingLoss.

Returns a dataset with columns: anchor, positive Each pair contains two abstracts from papers with the same primary_subject. MNRL will use in-batch negatives automatically.

Parameters:

Name Type Description Default
balanced bool

If True, each subject has equal probability of being chosen. If False, subjects are weighted by their frequency in the dataset.

True

ensure_data_exists

ensure_data_exists(
    data_dir: Path,
    dataset_config: DictConfig | ListConfig | None = None,
) -> None

Run preprocessing if required data doesn't exist or config has changed.

load_pairs

load_pairs(load_path: Path) -> Dataset

Load pairs from disk.

preprocess

preprocess(
    loss: LossType = LossType.MultipleNegativesRankingLoss,
    output_folder: Path = Path("data"),
    test_size: float = 0.2,
    number_of_pairs: int = 1000000,
    number_of_eval_pairs: int = 10000,
    seed: int = 42,
    source: str = "nick007x/arxiv-papers",
    columns: list[str] | None = None,
    text_field: str = "abstract",
    balanced: bool = True,
) -> None

Download and preprocess the arxiv papers dataset.

preprocess_hydra

preprocess_hydra(config: DictConfig | ListConfig) -> None

Hydra entry point for preprocessing.

mlops_project.model

mlops_project.evaluate

create_ir_evaluator

create_ir_evaluator(
    dataset,
    sample_size: int = 5000,
    name: str = "arxiv-retrieval",
)

Create an Information Retrieval evaluator for precision@k metrics.

mlops_project.train

mlops_project.utils

build_output_dir_name

build_output_dir_name(
    model: str, loss: str, num_pairs: int, balanced: bool
) -> str

Build output directory name from config values.

format_num_pairs

format_num_pairs(n: int) -> str

Format number of pairs for output directory name.

mlops_project.visualize

CLI tool for visualizing paper embeddings as a 2D scatter plot.

create_scatter_plot

create_scatter_plot(
    coords: ndarray,
    subjects: list[str],
    output_path: Path,
    figsize: int = 16,
    dpi: int = 300,
    point_size: float = 1.5,
    alpha: float = 0.8,
    min_cluster_size_for_label: int = 150,
) -> None

Create and save a scatter plot of embeddings colored by subject.

Parameters:

Name Type Description Default
coords ndarray

2D coordinates, shape (n_samples, 2).

required
subjects list[str]

Subject labels for each point.

required
output_path Path

Path to save the output image.

required
figsize int

Figure size (square).

16
dpi int

Output image DPI.

300
point_size float

Size of scatter points.

1.5
alpha float

Transparency of points.

0.8
min_cluster_size_for_label int

Only label clusters with at least this many points.

150

get_clean_label

get_clean_label(subject: str) -> str

Extract clean label from subject string like 'Machine Learning (cs.LG)' -> 'Machine Learning'.

load_metadata_subjects

load_metadata_subjects(index_dir: Path) -> list[str]

Load primary_subject for each paper from metadata.

Parameters:

Name Type Description Default
index_dir Path

Directory containing metadata.json or metadata.db.

required

Returns:

Type Description
list[str]

List of primary_subject strings, one per paper.

reduce_dimensions

reduce_dimensions(
    embeddings: ndarray,
    method: str = "umap",
    n_neighbors: int = 15,
    min_dist: float = 0.1,
    random_state: int = 42,
) -> np.ndarray

Reduce embeddings to 2D using UMAP or t-SNE.

Parameters:

Name Type Description Default
embeddings ndarray

Array of shape (n_samples, embedding_dim).

required
method str

"umap" or "tsne".

'umap'
n_neighbors int

UMAP n_neighbors parameter.

15
min_dist float

UMAP min_dist parameter.

0.1
random_state int

Random seed for reproducibility.

42

Returns:

Type Description
ndarray

Array of shape (n_samples, 2).

stratified_sample_indices

stratified_sample_indices(
    subjects: list[str], sample_size: int, rng: Generator
) -> np.ndarray

Sample indices with stratification by subject.

Parameters:

Name Type Description Default
subjects list[str]

List of subject labels.

required
sample_size int

Total number of samples to draw.

required
rng Generator

NumPy random generator.

required

Returns:

Type Description
ndarray

Array of sampled indices.

visualize

visualize(
    index_dir: str = typer.Option(
        "data/faiss",
        "--index-dir",
        "-i",
        help="Path to FAISS index directory.",
    ),
    output: str = typer.Option(
        "embedding_landscape.png",
        "--output",
        "-o",
        help="Output image path.",
    ),
    sample_size: int = typer.Option(
        50000,
        "--sample-size",
        "-n",
        help="Number of points to visualize.",
    ),
    method: str = typer.Option(
        "umap",
        "--method",
        "-m",
        help="Dimensionality reduction method: 'umap' or 'tsne'.",
    ),
    n_neighbors: int = typer.Option(
        15,
        "--n-neighbors",
        help="UMAP n_neighbors parameter.",
    ),
    min_dist: float = typer.Option(
        0.1, "--min-dist", help="UMAP min_dist parameter."
    ),
    dpi: int = typer.Option(
        300, "--dpi", help="Output image DPI."
    ),
    figsize: int = typer.Option(
        16, "--figsize", help="Figure size (square)."
    ),
    seed: int = typer.Option(
        42,
        "--seed",
        help="Random seed for reproducibility.",
    ),
) -> None

Generate a 2D scatter plot visualization of paper embeddings.

API module

The FastAPI module performs model loading at import time, which is not friendly to auto documentation. See API and deployment for endpoint and runtime details.