---
title: "Advanced: Submitting and Polling Multiple Batches"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Advanced: Submitting and Polling Multiple Batches}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
```

# 1. Overview

This vignette demonstrates how to use **pairwiseLLM** for **Batch API workflows** (server-side batching), which are distinct from the live API calls described in the [Getting Started](https://shmercer.github.io/pairwiseLLM/articles/getting-started.html) vignette.

Batch workflows are ideal for large-scale jobs because they:

- Allow submitting thousands of pairs at once  
- Are often cheaper (e.g., discounted batch pricing on some providers)  
- Avoid client-side timeout and connection issues  
- Can be polled and resumed even if your local R session ends  

Supported Batch API providers:

- **OpenAI** (batch pipeline: `run_openai_batch_pipeline()`)
- **Anthropic** (batch pipeline: `run_anthropic_batch_pipeline()`)
- **Gemini** (batch pipeline: `run_gemini_batch_pipeline()`)

> **Recommended approach:** For *multiple* batches (e.g., templates × providers × models × forward/reverse), use:
> 
> - `llm_submit_pairs_multi_batch()` to **split + submit** jobs (no polling; writes an optional registry CSV)
> - `llm_resume_multi_batches()` to **poll + download + parse** results (can resume from a registry on disk)
>
> These helpers orchestrate the provider-specific pipelines without forcing you to write your own polling loops.

> **Note:** **Together.ai** and **Ollama** do not currently support a native Batch API compatible with this workflow. For those providers, use the **live** API wrapper `submit_llm_pairs()` as described in the [Getting Started](https://shmercer.github.io/pairwiseLLM/articles/getting-started.html) vignette.

In this vignette, we will cover:

- Designing a grid of provider/model/thinking/direction combinations
- Submitting *many* batch jobs using the multi-batch helpers
- Polling and resuming safely via on-disk registries
- Producing per-run and merged results tables

> Note: All heavy API calls in this vignette are set to `eval = FALSE` so that the vignette remains CRAN-safe. You can enable them in your own project.

For basic function usage, see the companion vignette:

* [`vignette("getting-started")`](https://shmercer.github.io/pairwiseLLM/articles/getting-started.html)

For prompt evaluation and positional-bias diagnostics, see the companion vignette:

* [`vignette("prompt-template-bias")`](https://shmercer.github.io/pairwiseLLM/articles/prompt-template-bias.html)

# 2. Setup and API Keys

```{r setup, message=FALSE}
library(pairwiseLLM)
library(dplyr)
library(tidyr)
library(purrr)
library(readr)
library(stringr)
```

Required environment variables:

| Provider | Environment Variable |
|----------|----------------------|
| OpenAI   | `OPENAI_API_KEY`     |
| Anthropic | `ANTHROPIC_API_KEY` |
| Gemini    | `GEMINI_API_KEY`    |

Check which are set:

```{r}
check_llm_api_keys()
```

# 3. Example Data and Prompt Template

We use the built-in writing samples and a single trait (`overall_quality`).

```{r}
data("example_writing_samples", package = "pairwiseLLM")

td <- trait_description("overall_quality")
td
```

Default prompt template:

```{r}
tmpl <- set_prompt_template()
cat(substr(tmpl, 1, 400), "...
")
```

Construct a modest number of pairs to keep the example light:

```{r}
set.seed(123)

pairs_all <- example_writing_samples |>
  make_pairs()

n_pairs <- min(40L, nrow(pairs_all))

pairs_forward <- pairs_all |>
  sample_pairs(n_pairs = n_pairs, seed = 123) |>
  randomize_pair_order(seed = 456)

pairs_reverse <- sample_reverse_pairs(
  pairs_forward,
  reverse_pct = 1.0,
  seed        = 789
)

get_pairs_for_direction <- function(direction = c("forward", "reverse")) {
  direction <- match.arg(direction)
  if (identical(direction, "forward")) {
    pairs_forward
  } else {
    pairs_reverse
  }
}
```

# 4. Designing the Batch Grid

Suppose we want to test several prompt templates across:

- Anthropic models (with/without “thinking”)
- OpenAI models (with/without “thinking” for specific models)
- Gemini models (with “thinking” enabled)

Here we define a small grid:

```{r}
anthropic_models <- c(
  "claude-sonnet-4-5",
  "claude-haiku-4-5",
  "claude-opus-4-5"
)

gemini_models <- c(
  "gemini-3-pro-preview"
)

openai_models <- c(
  "gpt-4.1",
  "gpt-4o",
  "gpt-5.1"
)

thinking_levels <- c("no_thinking", "with_thinking")
directions <- c("forward", "reverse")

anthropic_grid <- tidyr::expand_grid(
  provider  = "anthropic",
  model     = anthropic_models,
  thinking  = thinking_levels,
  direction = directions
)

gemini_grid <- tidyr::expand_grid(
  provider  = "gemini",
  model     = gemini_models,
  thinking  = "with_thinking",
  direction = directions
)

openai_grid <- tidyr::expand_grid(
  provider  = "openai",
  model     = openai_models,
  thinking  = thinking_levels,
  direction = directions
) |>
  # For example, only allow "with_thinking" for gpt-5.1
  dplyr::filter(model == "gpt-5.1" | thinking == "no_thinking")

batch_grid <- dplyr::bind_rows(
  anthropic_grid,
  gemini_grid,
  openai_grid
)

batch_grid
```

We will also imagine multiple prompt templates have been registered. For
simplicity, we use the same `tmpl` string, but in practice you would substitute
different text:

```{r}
templates_tbl <- tibble::tibble(
  template_id     = c("test1", "test2", "test3", "test4", "test5"),
  prompt_template = list(tmpl, tmpl, tmpl, tmpl, tmpl)
)

templates_tbl
```

# 5. Submitting Many Batches with the Multi‑Batch Helpers

The key idea is:

- Each **combination** of `(template_id, provider, model, thinking, direction)` becomes a **run**
- Each run writes its files into its own subdirectory (so file names never collide)
- Within each run you can still split into multiple segments using `batch_size` or `n_segments`

## 5.1 Create a run plan and output directory

```{r, eval=FALSE}
out_root <- "dev-output/advanced-multi-batch"
dir.create(out_root, recursive = TRUE, showWarnings = FALSE)

run_plan <- tidyr::crossing(
  templates_tbl |> tidyr::unnest(prompt_template),
  batch_grid
) |>
  mutate(
    run_id = paste(template_id, provider, model, thinking, direction, sep = "__"),
    run_id = gsub("[^A-Za-z0-9_.-]+", "-", run_id),
    run_dir = file.path(out_root, run_id)
  )

run_plan |> dplyr::select(run_id, template_id, provider, model, thinking, direction, run_dir)
```

## 5.2 Submit all runs (no polling)

Below we submit each run using `llm_submit_pairs_multi_batch()`. This returns a `jobs` list and writes a `jobs_registry.csv` under each run directory (because `write_registry = TRUE`).

Provider-specific options can be forwarded via `...`. In the example below we:

- Enable “thinking” output where applicable
- Ask providers to include raw outputs in addition to parsed tags (helpful for debugging)

```{r, eval=FALSE}
submit_one_run <- function(template_id, prompt_template, provider, model, thinking, direction, run_dir) {
  pairs_use   <- get_pairs_for_direction(direction)
  is_thinking <- identical(thinking, "with_thinking")

  # Provider-specific knobs (passed through via ...)
  extra_args <- list()

  if (identical(provider, "openai")) {
    # Only request thoughts for models that support them in this workflow
    extra_args$include_thoughts <- is_thinking && grepl("^gpt-5\\.1", model)
    extra_args$include_raw      <- TRUE
  } else if (identical(provider, "anthropic")) {
    extra_args$reasoning        <- if (is_thinking) "enabled" else "none"
    extra_args$include_thoughts <- is_thinking
    extra_args$include_raw      <- TRUE
    # Optional: set deterministic temperature when not using reasoning
    # Optional: set deterministic temperature when not using reasoning
    if (!is_thinking) extra_args$temperature <- 0
  } else if (identical(provider, "gemini")) {
    extra_args$include_thoughts <- TRUE
    extra_args$thinking_level   <- "low"   # example
    extra_args$include_raw      <- TRUE
  }

  message(
    "Submitting: ", template_id, " | ", provider, " / ", model,
    " / ", thinking, " / ", direction
  )

  # Split strategy:
  # - For real jobs, use batch_size (e.g., 500–5000) or n_segments (e.g., 10–50)
  # - Here we keep it simple and submit a single segment per run
  do.call(
    llm_submit_pairs_multi_batch,
    c(
      list(
        pairs             = pairs_use,
        backend           = provider,
        model             = model,
        trait_name        = td$name,
        trait_description = td$description,
        prompt_template   = prompt_template,
        n_segments        = 1L,
        output_dir        = run_dir,
        write_registry    = TRUE,
        verbose           = TRUE
      ),
      extra_args
    )
  )
}

run_results <- purrr::pmap(
  run_plan,
  submit_one_run
)

# Store a lightweight manifest so you can resume later without rebuilding run_plan
manifest <- run_plan |>
  mutate(registry_path = file.path(run_dir, "jobs_registry.csv"))

manifest_path <- file.path(out_root, "run_manifest.csv")
readr::write_csv(manifest, manifest_path)

manifest_path
```

At this point, each run directory contains:

- JSONL input/output placeholders (one per segment)
- A `jobs_registry.csv` that records all batch IDs and file paths for that run

You can safely stop R or restart your machine after submission.

# 6. Polling, Downloading, and Parsing (Resumable)

To poll all runs, read the manifest and call `llm_resume_multi_batches()` for each `run_dir`. If you restart R, you can resume **without** keeping the `jobs` objects in memory by setting `jobs = NULL` and pointing to `output_dir` (the function will load `jobs_registry.csv`).

```{r, eval=FALSE}
manifest_path <- file.path(out_root, "run_manifest.csv")
manifest <- readr::read_csv(manifest_path, show_col_types = FALSE)

poll_one_run <- function(run_dir) {
  llm_resume_multi_batches(
    jobs               = NULL,   # load from jobs_registry.csv in run_dir
    output_dir         = run_dir,
    interval_seconds   = 60,
    per_job_delay      = 2,
    write_results_csv  = TRUE,   # writes batch_XX_results.csv files
    write_registry     = TRUE,   # refreshes jobs_registry.csv with done flags
    keep_jsonl         = TRUE,
    verbose            = TRUE,
    write_combined_csv = TRUE,   # writes combined_results.csv inside run_dir
    combined_csv_path  = "combined_results.csv"
  )
}

polled <- purrr::map(manifest$run_dir, poll_one_run)
```

## 6.1 Building a single merged results table (all runs)

Each element of `polled` contains a `combined` tibble for that run (i.e., all segments bound together). We can attach run metadata (template/provider/model/thinking/direction) and then bind all runs into one master table.

```{r, eval=FALSE}
combined_all <- purrr::map2_dfr(
  polled,
  seq_len(nrow(manifest)),
  function(res, i) {
    meta <- manifest[i, ]
    if (is.null(res$combined)) return(NULL)

    res$combined |>
      mutate(
        template_id = meta$template_id,
        provider    = meta$provider,
        model       = meta$model,
        thinking    = meta$thinking,
        direction   = meta$direction,
        run_id      = meta$run_id
      )
  }
)

combined_path <- file.path(out_root, "combined_all_runs.csv")
readr::write_csv(combined_all, combined_path)

combined_path
```

# 7. Resuming After Interruption

Resuming jobs is possible:

- Submission writes `jobs_registry.csv` under each run directory
- Polling can be restarted at any time by calling `llm_resume_multi_batches(jobs = NULL, output_dir = <run_dir>)`
- If you keep a `run_manifest.csv` with `run_dir` paths, resuming *all* runs is just a loop

Example: resume only unfinished runs (based on each run’s registry):

```{r, eval=FALSE}
manifest <- readr::read_csv(file.path(out_root, "run_manifest.csv"), show_col_types = FALSE)

needs_poll <- function(run_dir) {
  reg_path <- file.path(run_dir, "jobs_registry.csv")
  if (!file.exists(reg_path)) return(FALSE)
  reg <- readr::read_csv(reg_path, show_col_types = FALSE)
  any(!as.logical(reg$done))
}

unfinished_dirs <- manifest$run_dir[vapply(manifest$run_dir, needs_poll, logical(1))]

polled <- purrr::map(unfinished_dirs, poll_one_run)
```

# 8. Next Steps

Once you have per-run results CSVs (e.g., one per template × model × thinking × direction), you can:

- Compute **reverse consistency** with `compute_reverse_consistency()`
- Analyze **positional bias** with `check_positional_bias()`
- Aggregate results by provider/model/template using standard `dplyr` pipelines
- Fit **Bradley–Terry** models with `build_bt_data()` + `fit_bt_model()`
- Fit **Elo** models with `fit_elo_model()` (when `EloChoice` is installed)

# 9. Citation

> Mercer, S. H. (2025). *Advanced: Submitting and polling multiple batches* [R package vignette]. 
In *pairwiseLLM: Pairwise comparison tools for large language model-based writing evaluation*. 
https://doi.org/10.32614/CRAN.package.pairwiseLLM