Python API¶

This page contains the auto-generated API reference for the Python modules in src/mlops_project.

Modules¶

mlops_project.data¶

ArxivPapersDataset ¶

Arxiv papers dataset from Hugging Face.

getitem ¶

__getitem__(index: int) -> dict

Return a given sample from the dataset.

len ¶

__len__() -> int

Return the length of the dataset.

create_contrastive_pairs ¶

create_contrastive_pairs(
    dataset,
    num_pairs: int = 100000,
    text_field: str = "abstract",
    seed: int = 42,
    balanced: bool = True,
)

Create positive and negative pairs for ContrastiveLoss.

Returns a dataset with columns: sentence1, sentence2, label - label=1.0 for positive pairs (same subject) - label=0.0 for negative pairs (different subjects)

Parameters:

Name	Type	Description	Default
`balanced`	`bool`	If True, each subject has equal probability of being chosen. If False, subjects are weighted by their frequency in the dataset.	`True`

create_pairs ¶

create_pairs(
    dataset,
    pair_fn: Callable,
    save_path: Path,
    num_pairs: int,
    text_field: str = "abstract",
    seed: int = 42,
    balanced: bool = True,
) -> Dataset

Create and save pairs to disk.

create_positive_pairs ¶

create_positive_pairs(
    dataset,
    num_pairs: int = 100000,
    text_field: str = "abstract",
    seed: int = 42,
    balanced: bool = True,
)

Create positive pairs for MultipleNegativesRankingLoss.

Returns a dataset with columns: anchor, positive Each pair contains two abstracts from papers with the same primary_subject. MNRL will use in-batch negatives automatically.

Parameters:

Name	Type	Description	Default
`balanced`	`bool`	If True, each subject has equal probability of being chosen. If False, subjects are weighted by their frequency in the dataset.	`True`

ensure_data_exists ¶

ensure_data_exists(
    data_dir: Path,
    dataset_config: DictConfig | ListConfig | None = None,
) -> None

Run preprocessing if required data doesn't exist or config has changed.

load_pairs ¶

load_pairs(load_path: Path) -> Dataset

Load pairs from disk.

preprocess ¶

preprocess(
    loss: LossType = LossType.MultipleNegativesRankingLoss,
    output_folder: Path = Path("data"),
    test_size: float = 0.2,
    number_of_pairs: int = 1000000,
    number_of_eval_pairs: int = 10000,
    seed: int = 42,
    source: str = "nick007x/arxiv-papers",
    columns: list[str] | None = None,
    text_field: str = "abstract",
    balanced: bool = True,
) -> None

Download and preprocess the arxiv papers dataset.

preprocess_hydra ¶

preprocess_hydra(config: DictConfig | ListConfig) -> None

Hydra entry point for preprocessing.

mlops_project.model¶

mlops_project.evaluate¶

create_ir_evaluator ¶

create_ir_evaluator(
    dataset,
    sample_size: int = 5000,
    name: str = "arxiv-retrieval",
)

Create an Information Retrieval evaluator for precision@k metrics.

mlops_project.train¶

mlops_project.utils¶

build_output_dir_name ¶

build_output_dir_name(
    model: str, loss: str, num_pairs: int, balanced: bool
) -> str

Build output directory name from config values.

format_num_pairs ¶

format_num_pairs(n: int) -> str

Format number of pairs for output directory name.

mlops_project.visualize¶

CLI tool for visualizing paper embeddings as a 2D scatter plot.

create_scatter_plot ¶

create_scatter_plot(
    coords: ndarray,
    subjects: list[str],
    output_path: Path,
    figsize: int = 16,
    dpi: int = 300,
    point_size: float = 1.5,
    alpha: float = 0.8,
    min_cluster_size_for_label: int = 150,
) -> None

Create and save a scatter plot of embeddings colored by subject.

Parameters:

Name	Type	Description	Default
`coords`	`ndarray`	2D coordinates, shape (n_samples, 2).	required
`subjects`	`list[str]`	Subject labels for each point.	required
`output_path`	`Path`	Path to save the output image.	required
`figsize`	`int`	Figure size (square).	`16`
`dpi`	`int`	Output image DPI.	`300`
`point_size`	`float`	Size of scatter points.	`1.5`
`alpha`	`float`	Transparency of points.	`0.8`
`min_cluster_size_for_label`	`int`	Only label clusters with at least this many points.	`150`

get_clean_label ¶

get_clean_label(subject: str) -> str

Extract clean label from subject string like 'Machine Learning (cs.LG)' -> 'Machine Learning'.

load_metadata_subjects ¶

load_metadata_subjects(index_dir: Path) -> list[str]

Load primary_subject for each paper from metadata.

Parameters:

Name	Type	Description	Default
`index_dir`	`Path`	Directory containing metadata.json or metadata.db.	required

Returns:

Type	Description
`list[str]`	List of primary_subject strings, one per paper.

reduce_dimensions ¶

reduce_dimensions(
    embeddings: ndarray,
    method: str = "umap",
    n_neighbors: int = 15,
    min_dist: float = 0.1,
    random_state: int = 42,
) -> np.ndarray

Reduce embeddings to 2D using UMAP or t-SNE.

Parameters:

Name	Type	Description	Default
`embeddings`	`ndarray`	Array of shape (n_samples, embedding_dim).	required
`method`	`str`	"umap" or "tsne".	`'umap'`
`n_neighbors`	`int`	UMAP n_neighbors parameter.	`15`
`min_dist`	`float`	UMAP min_dist parameter.	`0.1`
`random_state`	`int`	Random seed for reproducibility.	`42`

Returns:

Type	Description
`ndarray`	Array of shape (n_samples, 2).

stratified_sample_indices ¶

stratified_sample_indices(
    subjects: list[str], sample_size: int, rng: Generator
) -> np.ndarray

Sample indices with stratification by subject.

Parameters:

Name	Type	Description	Default
`subjects`	`list[str]`	List of subject labels.	required
`sample_size`	`int`	Total number of samples to draw.	required
`rng`	`Generator`	NumPy random generator.	required

Returns:

Type	Description
`ndarray`	Array of sampled indices.

visualize ¶

visualize(
    index_dir: str = typer.Option(
        "data/faiss",
        "--index-dir",
        "-i",
        help="Path to FAISS index directory.",
    ),
    output: str = typer.Option(
        "embedding_landscape.png",
        "--output",
        "-o",
        help="Output image path.",
    ),
    sample_size: int = typer.Option(
        50000,
        "--sample-size",
        "-n",
        help="Number of points to visualize.",
    ),
    method: str = typer.Option(
        "umap",
        "--method",
        "-m",
        help="Dimensionality reduction method: 'umap' or 'tsne'.",
    ),
    n_neighbors: int = typer.Option(
        15,
        "--n-neighbors",
        help="UMAP n_neighbors parameter.",
    ),
    min_dist: float = typer.Option(
        0.1, "--min-dist", help="UMAP min_dist parameter."
    ),
    dpi: int = typer.Option(
        300, "--dpi", help="Output image DPI."
    ),
    figsize: int = typer.Option(
        16, "--figsize", help="Figure size (square)."
    ),
    seed: int = typer.Option(
        42,
        "--seed",
        help="Random seed for reproducibility.",
    ),
) -> None

Generate a 2D scatter plot visualization of paper embeddings.

API module¶

The FastAPI module performs model loading at import time, which is not friendly to auto documentation. See API and deployment for endpoint and runtime details.

Python API¶

Modules¶

mlops_project.data¶

ArxivPapersDataset ¶

__getitem__ ¶

__len__ ¶

create_contrastive_pairs ¶

create_pairs ¶

create_positive_pairs ¶

ensure_data_exists ¶

load_pairs ¶

preprocess ¶

preprocess_hydra ¶

mlops_project.model¶

mlops_project.evaluate¶

create_ir_evaluator ¶

mlops_project.train¶

mlops_project.utils¶

build_output_dir_name ¶

format_num_pairs ¶

mlops_project.visualize¶

create_scatter_plot ¶

get_clean_label ¶

load_metadata_subjects ¶

reduce_dimensions ¶

stratified_sample_indices ¶

visualize ¶

API module¶

getitem ¶

len ¶