--- title: "Advanced: Submitting and Polling Multiple Batches" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Advanced: Submitting and Polling Multiple Batches} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # 1. Overview This vignette demonstrates how to use **pairwiseLLM** for **Batch API workflows** (server-side batching), which are distinct from the live API calls described in the [Getting Started](https://shmercer.github.io/pairwiseLLM/articles/getting-started.html) vignette. Batch workflows are ideal for large-scale jobs because they: - Allow submitting thousands of pairs at once - Are often cheaper (e.g., discounted batch pricing on some providers) - Avoid client-side timeout and connection issues - Can be polled and resumed even if your local R session ends Supported Batch API providers: - **OpenAI** (batch pipeline: `run_openai_batch_pipeline()`) - **Anthropic** (batch pipeline: `run_anthropic_batch_pipeline()`) - **Gemini** (batch pipeline: `run_gemini_batch_pipeline()`) > **Recommended approach:** For *multiple* batches (e.g., templates × providers × models × forward/reverse), use: > > - `llm_submit_pairs_multi_batch()` to **split + submit** jobs (no polling; writes an optional registry CSV) > - `llm_resume_multi_batches()` to **poll + download + parse** results (can resume from a registry on disk) > > These helpers orchestrate the provider-specific pipelines without forcing you to write your own polling loops. > **Note:** **Together.ai** and **Ollama** do not currently support a native Batch API compatible with this workflow. For those providers, use the **live** API wrapper `submit_llm_pairs()` as described in the [Getting Started](https://shmercer.github.io/pairwiseLLM/articles/getting-started.html) vignette. In this vignette, we will cover: - Designing a grid of provider/model/thinking/direction combinations - Submitting *many* batch jobs using the multi-batch helpers - Polling and resuming safely via on-disk registries - Producing per-run and merged results tables > Note: All heavy API calls in this vignette are set to `eval = FALSE` so that the vignette remains CRAN-safe. You can enable them in your own project. For basic function usage, see the companion vignette: * [`vignette("getting-started")`](https://shmercer.github.io/pairwiseLLM/articles/getting-started.html) For prompt evaluation and positional-bias diagnostics, see the companion vignette: * [`vignette("prompt-template-bias")`](https://shmercer.github.io/pairwiseLLM/articles/prompt-template-bias.html) # 2. Setup and API Keys ```{r setup, message=FALSE} library(pairwiseLLM) library(dplyr) library(tidyr) library(purrr) library(readr) library(stringr) ``` Required environment variables: | Provider | Environment Variable | |----------|----------------------| | OpenAI | `OPENAI_API_KEY` | | Anthropic | `ANTHROPIC_API_KEY` | | Gemini | `GEMINI_API_KEY` | Check which are set: ```{r} check_llm_api_keys() ``` # 3. Example Data and Prompt Template We use the built-in writing samples and a single trait (`overall_quality`). ```{r} data("example_writing_samples", package = "pairwiseLLM") td <- trait_description("overall_quality") td ``` Default prompt template: ```{r} tmpl <- set_prompt_template() cat(substr(tmpl, 1, 400), "... ") ``` Construct a modest number of pairs to keep the example light: ```{r} set.seed(123) pairs_all <- example_writing_samples |> make_pairs() n_pairs <- min(40L, nrow(pairs_all)) pairs_forward <- pairs_all |> sample_pairs(n_pairs = n_pairs, seed = 123) |> randomize_pair_order(seed = 456) pairs_reverse <- sample_reverse_pairs( pairs_forward, reverse_pct = 1.0, seed = 789 ) get_pairs_for_direction <- function(direction = c("forward", "reverse")) { direction <- match.arg(direction) if (identical(direction, "forward")) { pairs_forward } else { pairs_reverse } } ``` # 4. Designing the Batch Grid Suppose we want to test several prompt templates across: - Anthropic models (with/without “thinking”) - OpenAI models (with/without “thinking” for specific models) - Gemini models (with “thinking” enabled) Here we define a small grid: ```{r} anthropic_models <- c( "claude-sonnet-4-5", "claude-haiku-4-5", "claude-opus-4-5" ) gemini_models <- c( "gemini-3-pro-preview" ) openai_models <- c( "gpt-4.1", "gpt-4o", "gpt-5.1" ) thinking_levels <- c("no_thinking", "with_thinking") directions <- c("forward", "reverse") anthropic_grid <- tidyr::expand_grid( provider = "anthropic", model = anthropic_models, thinking = thinking_levels, direction = directions ) gemini_grid <- tidyr::expand_grid( provider = "gemini", model = gemini_models, thinking = "with_thinking", direction = directions ) openai_grid <- tidyr::expand_grid( provider = "openai", model = openai_models, thinking = thinking_levels, direction = directions ) |> # For example, only allow "with_thinking" for gpt-5.1 dplyr::filter(model == "gpt-5.1" | thinking == "no_thinking") batch_grid <- dplyr::bind_rows( anthropic_grid, gemini_grid, openai_grid ) batch_grid ``` We will also imagine multiple prompt templates have been registered. For simplicity, we use the same `tmpl` string, but in practice you would substitute different text: ```{r} templates_tbl <- tibble::tibble( template_id = c("test1", "test2", "test3", "test4", "test5"), prompt_template = list(tmpl, tmpl, tmpl, tmpl, tmpl) ) templates_tbl ``` # 5. Submitting Many Batches with the Multi‑Batch Helpers The key idea is: - Each **combination** of `(template_id, provider, model, thinking, direction)` becomes a **run** - Each run writes its files into its own subdirectory (so file names never collide) - Within each run you can still split into multiple segments using `batch_size` or `n_segments` ## 5.1 Create a run plan and output directory ```{r, eval=FALSE} out_root <- "dev-output/advanced-multi-batch" dir.create(out_root, recursive = TRUE, showWarnings = FALSE) run_plan <- tidyr::crossing( templates_tbl |> tidyr::unnest(prompt_template), batch_grid ) |> mutate( run_id = paste(template_id, provider, model, thinking, direction, sep = "__"), run_id = gsub("[^A-Za-z0-9_.-]+", "-", run_id), run_dir = file.path(out_root, run_id) ) run_plan |> dplyr::select(run_id, template_id, provider, model, thinking, direction, run_dir) ``` ## 5.2 Submit all runs (no polling) Below we submit each run using `llm_submit_pairs_multi_batch()`. This returns a `jobs` list and writes a `jobs_registry.csv` under each run directory (because `write_registry = TRUE`). Provider-specific options can be forwarded via `...`. In the example below we: - Enable “thinking” output where applicable - Ask providers to include raw outputs in addition to parsed tags (helpful for debugging) ```{r, eval=FALSE} submit_one_run <- function(template_id, prompt_template, provider, model, thinking, direction, run_dir) { pairs_use <- get_pairs_for_direction(direction) is_thinking <- identical(thinking, "with_thinking") # Provider-specific knobs (passed through via ...) extra_args <- list() if (identical(provider, "openai")) { # Only request thoughts for models that support them in this workflow extra_args$include_thoughts <- is_thinking && grepl("^gpt-5\\.1", model) extra_args$include_raw <- TRUE } else if (identical(provider, "anthropic")) { extra_args$reasoning <- if (is_thinking) "enabled" else "none" extra_args$include_thoughts <- is_thinking extra_args$include_raw <- TRUE # Optional: set deterministic temperature when not using reasoning # Optional: set deterministic temperature when not using reasoning if (!is_thinking) extra_args$temperature <- 0 } else if (identical(provider, "gemini")) { extra_args$include_thoughts <- TRUE extra_args$thinking_level <- "low" # example extra_args$include_raw <- TRUE } message( "Submitting: ", template_id, " | ", provider, " / ", model, " / ", thinking, " / ", direction ) # Split strategy: # - For real jobs, use batch_size (e.g., 500–5000) or n_segments (e.g., 10–50) # - Here we keep it simple and submit a single segment per run do.call( llm_submit_pairs_multi_batch, c( list( pairs = pairs_use, backend = provider, model = model, trait_name = td$name, trait_description = td$description, prompt_template = prompt_template, n_segments = 1L, output_dir = run_dir, write_registry = TRUE, verbose = TRUE ), extra_args ) ) } run_results <- purrr::pmap( run_plan, submit_one_run ) # Store a lightweight manifest so you can resume later without rebuilding run_plan manifest <- run_plan |> mutate(registry_path = file.path(run_dir, "jobs_registry.csv")) manifest_path <- file.path(out_root, "run_manifest.csv") readr::write_csv(manifest, manifest_path) manifest_path ``` At this point, each run directory contains: - JSONL input/output placeholders (one per segment) - A `jobs_registry.csv` that records all batch IDs and file paths for that run You can safely stop R or restart your machine after submission. # 6. Polling, Downloading, and Parsing (Resumable) To poll all runs, read the manifest and call `llm_resume_multi_batches()` for each `run_dir`. If you restart R, you can resume **without** keeping the `jobs` objects in memory by setting `jobs = NULL` and pointing to `output_dir` (the function will load `jobs_registry.csv`). ```{r, eval=FALSE} manifest_path <- file.path(out_root, "run_manifest.csv") manifest <- readr::read_csv(manifest_path, show_col_types = FALSE) poll_one_run <- function(run_dir) { llm_resume_multi_batches( jobs = NULL, # load from jobs_registry.csv in run_dir output_dir = run_dir, interval_seconds = 60, per_job_delay = 2, write_results_csv = TRUE, # writes batch_XX_results.csv files write_registry = TRUE, # refreshes jobs_registry.csv with done flags keep_jsonl = TRUE, verbose = TRUE, write_combined_csv = TRUE, # writes combined_results.csv inside run_dir combined_csv_path = "combined_results.csv" ) } polled <- purrr::map(manifest$run_dir, poll_one_run) ``` ## 6.1 Building a single merged results table (all runs) Each element of `polled` contains a `combined` tibble for that run (i.e., all segments bound together). We can attach run metadata (template/provider/model/thinking/direction) and then bind all runs into one master table. ```{r, eval=FALSE} combined_all <- purrr::map2_dfr( polled, seq_len(nrow(manifest)), function(res, i) { meta <- manifest[i, ] if (is.null(res$combined)) return(NULL) res$combined |> mutate( template_id = meta$template_id, provider = meta$provider, model = meta$model, thinking = meta$thinking, direction = meta$direction, run_id = meta$run_id ) } ) combined_path <- file.path(out_root, "combined_all_runs.csv") readr::write_csv(combined_all, combined_path) combined_path ``` # 7. Resuming After Interruption Resuming jobs is possible: - Submission writes `jobs_registry.csv` under each run directory - Polling can be restarted at any time by calling `llm_resume_multi_batches(jobs = NULL, output_dir = )` - If you keep a `run_manifest.csv` with `run_dir` paths, resuming *all* runs is just a loop Example: resume only unfinished runs (based on each run’s registry): ```{r, eval=FALSE} manifest <- readr::read_csv(file.path(out_root, "run_manifest.csv"), show_col_types = FALSE) needs_poll <- function(run_dir) { reg_path <- file.path(run_dir, "jobs_registry.csv") if (!file.exists(reg_path)) return(FALSE) reg <- readr::read_csv(reg_path, show_col_types = FALSE) any(!as.logical(reg$done)) } unfinished_dirs <- manifest$run_dir[vapply(manifest$run_dir, needs_poll, logical(1))] polled <- purrr::map(unfinished_dirs, poll_one_run) ``` # 8. Next Steps Once you have per-run results CSVs (e.g., one per template × model × thinking × direction), you can: - Compute **reverse consistency** with `compute_reverse_consistency()` - Analyze **positional bias** with `check_positional_bias()` - Aggregate results by provider/model/template using standard `dplyr` pipelines - Fit **Bradley–Terry** models with `build_bt_data()` + `fit_bt_model()` - Fit **Elo** models with `fit_elo_model()` (when `EloChoice` is installed) # 9. Citation > Mercer, S. H. (2025). *Advanced: Submitting and polling multiple batches* [R package vignette]. In *pairwiseLLM: Pairwise comparison tools for large language model-based writing evaluation*. https://doi.org/10.32614/CRAN.package.pairwiseLLM