Python API¶
This page contains the auto-generated API reference for the Python modules in src/mlops_project.
Modules¶
mlops_project.data¶
ArxivPapersDataset ¶
create_contrastive_pairs ¶
create_contrastive_pairs(
dataset,
num_pairs: int = 100000,
text_field: str = "abstract",
seed: int = 42,
balanced: bool = True,
)
Create positive and negative pairs for ContrastiveLoss.
Returns a dataset with columns: sentence1, sentence2, label - label=1.0 for positive pairs (same subject) - label=0.0 for negative pairs (different subjects)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
balanced | bool | If True, each subject has equal probability of being chosen. If False, subjects are weighted by their frequency in the dataset. | True |
create_pairs ¶
create_pairs(
dataset,
pair_fn: Callable,
save_path: Path,
num_pairs: int,
text_field: str = "abstract",
seed: int = 42,
balanced: bool = True,
) -> Dataset
Create and save pairs to disk.
create_positive_pairs ¶
create_positive_pairs(
dataset,
num_pairs: int = 100000,
text_field: str = "abstract",
seed: int = 42,
balanced: bool = True,
)
Create positive pairs for MultipleNegativesRankingLoss.
Returns a dataset with columns: anchor, positive Each pair contains two abstracts from papers with the same primary_subject. MNRL will use in-batch negatives automatically.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
balanced | bool | If True, each subject has equal probability of being chosen. If False, subjects are weighted by their frequency in the dataset. | True |
ensure_data_exists ¶
ensure_data_exists(
data_dir: Path,
dataset_config: DictConfig | ListConfig | None = None,
) -> None
Run preprocessing if required data doesn't exist or config has changed.
preprocess ¶
preprocess(
loss: LossType = LossType.MultipleNegativesRankingLoss,
output_folder: Path = Path("data"),
test_size: float = 0.2,
number_of_pairs: int = 1000000,
number_of_eval_pairs: int = 10000,
seed: int = 42,
source: str = "nick007x/arxiv-papers",
columns: list[str] | None = None,
text_field: str = "abstract",
balanced: bool = True,
) -> None
Download and preprocess the arxiv papers dataset.
preprocess_hydra ¶
Hydra entry point for preprocessing.
mlops_project.model¶
mlops_project.evaluate¶
create_ir_evaluator ¶
Create an Information Retrieval evaluator for precision@k metrics.
mlops_project.train¶
mlops_project.utils¶
mlops_project.visualize¶
CLI tool for visualizing paper embeddings as a 2D scatter plot.
create_scatter_plot ¶
create_scatter_plot(
coords: ndarray,
subjects: list[str],
output_path: Path,
figsize: int = 16,
dpi: int = 300,
point_size: float = 1.5,
alpha: float = 0.8,
min_cluster_size_for_label: int = 150,
) -> None
Create and save a scatter plot of embeddings colored by subject.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
coords | ndarray | 2D coordinates, shape (n_samples, 2). | required |
subjects | list[str] | Subject labels for each point. | required |
output_path | Path | Path to save the output image. | required |
figsize | int | Figure size (square). | 16 |
dpi | int | Output image DPI. | 300 |
point_size | float | Size of scatter points. | 1.5 |
alpha | float | Transparency of points. | 0.8 |
min_cluster_size_for_label | int | Only label clusters with at least this many points. | 150 |
get_clean_label ¶
Extract clean label from subject string like 'Machine Learning (cs.LG)' -> 'Machine Learning'.
load_metadata_subjects ¶
Load primary_subject for each paper from metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index_dir | Path | Directory containing metadata.json or metadata.db. | required |
Returns:
| Type | Description |
|---|---|
list[str] | List of primary_subject strings, one per paper. |
reduce_dimensions ¶
reduce_dimensions(
embeddings: ndarray,
method: str = "umap",
n_neighbors: int = 15,
min_dist: float = 0.1,
random_state: int = 42,
) -> np.ndarray
Reduce embeddings to 2D using UMAP or t-SNE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeddings | ndarray | Array of shape (n_samples, embedding_dim). | required |
method | str | "umap" or "tsne". | 'umap' |
n_neighbors | int | UMAP n_neighbors parameter. | 15 |
min_dist | float | UMAP min_dist parameter. | 0.1 |
random_state | int | Random seed for reproducibility. | 42 |
Returns:
| Type | Description |
|---|---|
ndarray | Array of shape (n_samples, 2). |
stratified_sample_indices ¶
Sample indices with stratification by subject.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subjects | list[str] | List of subject labels. | required |
sample_size | int | Total number of samples to draw. | required |
rng | Generator | NumPy random generator. | required |
Returns:
| Type | Description |
|---|---|
ndarray | Array of sampled indices. |
visualize ¶
visualize(
index_dir: str = typer.Option(
"data/faiss",
"--index-dir",
"-i",
help="Path to FAISS index directory.",
),
output: str = typer.Option(
"embedding_landscape.png",
"--output",
"-o",
help="Output image path.",
),
sample_size: int = typer.Option(
50000,
"--sample-size",
"-n",
help="Number of points to visualize.",
),
method: str = typer.Option(
"umap",
"--method",
"-m",
help="Dimensionality reduction method: 'umap' or 'tsne'.",
),
n_neighbors: int = typer.Option(
15,
"--n-neighbors",
help="UMAP n_neighbors parameter.",
),
min_dist: float = typer.Option(
0.1, "--min-dist", help="UMAP min_dist parameter."
),
dpi: int = typer.Option(
300, "--dpi", help="Output image DPI."
),
figsize: int = typer.Option(
16, "--figsize", help="Figure size (square)."
),
seed: int = typer.Option(
42,
"--seed",
help="Random seed for reproducibility.",
),
) -> None
Generate a 2D scatter plot visualization of paper embeddings.
API module¶
The FastAPI module performs model loading at import time, which is not friendly to auto documentation. See API and deployment for endpoint and runtime details.