| Title: | Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation |
|---|---|
| Description: | Provides a unified framework for generating, submitting, and analyzing pairwise comparisons of writing quality using large language models (LLMs). The package supports live and/or batch evaluation workflows across multiple providers ('OpenAI', 'Anthropic', 'Google Gemini', 'Together AI', and locally-hosted 'Ollama' models), includes bias-tested prompt templates and a flexible template registry, and offers tools for constructing forward and reversed comparison sets to analyze consistency and positional bias. The package additionally supports adaptive pairing workflows that iteratively select comparisons based on model uncertainty to improve ranking efficiency. Results can be modeled using frequentist or Bayesian Bradley–Terry–Luce models (Bradley & Terry, 1952 <doi:10.2307/2334029>; see also Caron & Doucet, 2012 <doi:10.1080/10618600.2012.638220>) or Elo rating methods (see Clark et al., 2018 <doi:10.1371/journal.pone.0190393>) to derive writing quality scores. For information on pairwise comparisons and comparitive judgement, see Thurstone (1927) <doi:10.1037/h0070288> and Heldsinger & Humphry (2010) <doi:10.1007/BF03216919>. |
| Authors: | Sterett H. Mercer [aut, cre] (ORCID: <https://orcid.org/0000-0002-7940-4221>) |
| Maintainer: | Sterett H. Mercer <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.3.0 |
| Built: | 2026-05-12 06:39:14 UTC |
| Source: | https://github.com/shmercer/pairwisellm |
Retrieve canonical adaptive logs.
adaptive_get_logs(state)adaptive_get_logs(state)
state |
Adaptive state. |
Returns the three canonical Adaptive logs as currently held in memory:
step_log, round_log, and item_log. These correspond to
step attempts, posterior refit rounds, and item-level refit summaries
respectively.
A named list with three elements:
A tibble with one row per attempted step.
A tibble with one row per BTL refit round.
A list of per-refit item tibbles.
adaptive_step_log(), adaptive_round_log(), adaptive_item_log()
Other adaptive logs:
adaptive_item_log(),
adaptive_results_history(),
adaptive_round_log(),
adaptive_step_log()
state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) logs <- adaptive_get_logs(state) names(logs)state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) logs <- adaptive_get_logs(state) names(logs)
Adaptive item log accessor.
adaptive_item_log(state, refit_id = NULL, stack = FALSE)adaptive_item_log(state, refit_id = NULL, stack = FALSE)
state |
Adaptive state. |
refit_id |
Optional refit index. |
stack |
When TRUE, stack all refits. |
item_log stores per-item posterior summaries by refit.
The underlying state stores a list of refit tables; this
accessor can return one refit table (default: most recent) or stack all
refits into a single tibble.
A tibble of item-level summaries. When stack = FALSE, one row
per item for the selected refit. When stack = TRUE, one row per item
per refit with refit_id identifying source refit.
adaptive_get_logs(), summarize_items(), adaptive_round_log()
Other adaptive logs:
adaptive_get_logs(),
adaptive_results_history(),
adaptive_round_log(),
adaptive_step_log()
state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_item_log(state)state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_item_log(state)
High-level workflow wrapper that reads sample data, constructs an LLM judge,
starts or resumes adaptive state, runs adaptive_rank_run_live(), and
returns state plus summary outputs.
adaptive_rank( data, id_col = 1, text_col = 2, backend = c("openai", "anthropic", "gemini", "together", "ollama"), model = NULL, trait = "overall_quality", trait_name = NULL, trait_description = NULL, prompt_template = set_prompt_template(), endpoint = "chat.completions", api_key = NULL, include_raw = FALSE, judge_args = list(), judge_call_args = list(), n_steps = 1L, fit_fn = NULL, adaptive_config = NULL, btl_config = NULL, session_dir = NULL, persist_item_log = FALSE, resume = TRUE, seed = 1L, progress = c("all", "refits", "steps", "none"), progress_redraw_every = 10L, progress_show_events = TRUE, progress_errors = TRUE, save_outputs = FALSE, output_file = NULL, judge = NULL )adaptive_rank( data, id_col = 1, text_col = 2, backend = c("openai", "anthropic", "gemini", "together", "ollama"), model = NULL, trait = "overall_quality", trait_name = NULL, trait_description = NULL, prompt_template = set_prompt_template(), endpoint = "chat.completions", api_key = NULL, include_raw = FALSE, judge_args = list(), judge_call_args = list(), n_steps = 1L, fit_fn = NULL, adaptive_config = NULL, btl_config = NULL, session_dir = NULL, persist_item_log = FALSE, resume = TRUE, seed = 1L, progress = c("all", "refits", "steps", "none"), progress_redraw_every = 10L, progress_show_events = TRUE, progress_errors = TRUE, save_outputs = FALSE, output_file = NULL, judge = NULL )
data |
Data source: a data frame/tibble, a file path ( |
id_col |
ID column selector for tabular inputs. Passed to
|
text_col |
Text column selector for tabular inputs. Passed to
|
backend |
Backend passed to |
model |
Model passed to |
trait |
Built-in trait key used when no custom trait is supplied.
Ignored when both |
trait_name |
Optional custom trait display name. |
trait_description |
Optional custom trait definition. |
prompt_template |
Prompt template string. Defaults to
|
endpoint |
Endpoint family passed to |
api_key |
Optional API key passed to |
include_raw |
Logical; forwarded to |
judge_args |
Named list of fixed additional arguments forwarded to
|
judge_call_args |
Named list of additional arguments forwarded to the
judge at run time through |
n_steps |
Maximum number of attempted adaptive steps to execute in this call. The run may return earlier due to candidate starvation or BTL stop criteria. Attempted invalid steps also count toward this limit. |
fit_fn |
Optional fit override passed to |
adaptive_config |
Optional named list passed to
|
btl_config |
Optional named list passed to |
session_dir |
Optional session directory for persistence/resume. |
persist_item_log |
Logical; write per-refit item logs when |
resume |
Logical; when |
seed |
Integer seed used when creating a new adaptive state. |
progress |
Progress mode for |
progress_redraw_every |
Redraw interval for progress output. |
progress_show_events |
Logical; show step events. |
progress_errors |
Logical; show invalid-step events. |
save_outputs |
Logical; when |
output_file |
Optional output |
judge |
Optional prebuilt judge function with contract
|
This helper is designed for end users who want one entry point for adaptive runs. It supports:
data input from a data frame, file (.csv, .tsv, .txt, .rds),
or a directory of .txt files;
model/backend configuration through make_adaptive_judge_llm();
all adaptive runtime controls exposed by adaptive_rank_run_live();
resumability via session_dir and resume;
optional saving of run outputs to an .rds artifact.
Model options:
use judge_args (fixed) and judge_call_args (per-run overrides) to pass
any additional llm_compare_pair() arguments, including provider-specific
controls such as reasoning, service_tier, temperature, top_p,
logprobs, include_thoughts, or host.
Adaptive options:
all key controls from adaptive_rank_run_live() are available directly:
n_steps, fit_fn, adaptive_config, btl_config, progress,
progress_redraw_every, progress_show_events, progress_errors,
session_dir, and persist_item_log.
Use adaptive_config for identifiability-gated controller behavior and
btl_config for inference/diagnostics cadence only.
Selection semantics: pair selection is TrueSkill-driven in one-pair transactional steps. Rolling anchors are refreshed from current score proxies and anchor-link routing compares exactly one anchor endpoint with one non-anchor endpoint. Long/mid-link routing excludes anchor-anchor and anchor-non-anchor pairs, while local-link routing admits same-stratum pairs and anchor-involving pairs according to stage bounds.
Wrapper-visible defaults include top-band refinement
(top_band_pct = 0.10, top_band_bins = 5) with top-band size computed as
ceiling(top_band_pct * N).
Exposure and repeat routing:
under-represented routing is degree-based (deg <= D_min + 1), while
repeat-pressure gating is based on recent exposure (bottom-quantile
recent_deg with quantile default 0.25) and per-endpoint repeat slot
accounting.
Inference separation: BTL refits are used for posterior inference, diagnostics, and stopping only. They are not used to choose the next pair.
Resume behavior:
when resume = TRUE and session_dir already contains adaptive artifacts,
failed session loads abort with an actionable error instead of starting a
fresh run silently.
A list with:
Final adaptive_state.
Run-level summary from summarize_adaptive().
Per-refit summary from summarize_refits().
Item summary from summarize_items().
Canonical logs from adaptive_get_logs().
Saved output path when save_outputs = TRUE, otherwise
NULL.
make_adaptive_judge_llm(), adaptive_rank_run_live(),
adaptive_rank_start(), adaptive_rank_resume(), llm_compare_pair()
Other adaptive ranking:
adaptive_rank_resume(),
adaptive_rank_run_live(),
adaptive_rank_start(),
make_adaptive_judge_llm(),
summarize_adaptive()
data("example_writing_samples", package = "pairwiseLLM") out <- adaptive_rank( data = example_writing_samples[1:8, c("ID", "text", "quality_score")], id_col = "ID", text_col = "text", model = "gpt-5.1", judge = function(A, B, state, ...) { y <- as.integer(A$quality_score[[1]] >= B$quality_score[[1]]) list(is_valid = TRUE, Y = y, invalid_reason = NA_character_) }, n_steps = 4, progress = "none" ) out$summary head(out$logs$step_log) ## Not run: # Live run with OpenAI gpt-5.1 + flex priority. live <- adaptive_rank( data = example_writing_samples[1:12, c("ID", "text")], backend = "openai", model = "gpt-5.1", endpoint = "responses", judge_args = list( reasoning = "low", service_tier = "flex", include_thoughts = FALSE ), btl_config = list( refit_pairs_target = 20L, ess_bulk_min = 500, eap_reliability_min = 0.92 ), adaptive_config = list( explore_taper_mult = 0.40, star_override_budget_per_round = 2L ), n_steps = 120, session_dir = file.path(tempdir(), "adaptive-live"), persist_item_log = TRUE, resume = TRUE, progress = "all", save_outputs = TRUE ) print(live$state) live$summary ## End(Not run)data("example_writing_samples", package = "pairwiseLLM") out <- adaptive_rank( data = example_writing_samples[1:8, c("ID", "text", "quality_score")], id_col = "ID", text_col = "text", model = "gpt-5.1", judge = function(A, B, state, ...) { y <- as.integer(A$quality_score[[1]] >= B$quality_score[[1]]) list(is_valid = TRUE, Y = y, invalid_reason = NA_character_) }, n_steps = 4, progress = "none" ) out$summary head(out$logs$step_log) ## Not run: # Live run with OpenAI gpt-5.1 + flex priority. live <- adaptive_rank( data = example_writing_samples[1:12, c("ID", "text")], backend = "openai", model = "gpt-5.1", endpoint = "responses", judge_args = list( reasoning = "low", service_tier = "flex", include_thoughts = FALSE ), btl_config = list( refit_pairs_target = 20L, ess_bulk_min = 500, eap_reliability_min = 0.92 ), adaptive_config = list( explore_taper_mult = 0.40, star_override_budget_per_round = 2L ), n_steps = 120, session_dir = file.path(tempdir(), "adaptive-live"), persist_item_log = TRUE, resume = TRUE, progress = "all", save_outputs = TRUE ) print(live$state) live$summary ## End(Not run)
Resume a previously persisted adaptive pairing session.
adaptive_rank_resume(session_dir, ...)adaptive_rank_resume(session_dir, ...)
session_dir |
Directory containing session artifacts. |
... |
Reserved for future extensions; currently unused. |
This is a thin wrapper around load_adaptive_session() and performs schema
and log-shape checks during load. Returned state preserves canonical
step_log, round_log, and item_log contents used for
adaptive auditability.
An adaptive_state object restored from disk.
adaptive_rank_start(), adaptive_rank_run_live(),
save_adaptive_session(), load_adaptive_session()
Other adaptive ranking:
adaptive_rank(),
adaptive_rank_run_live(),
adaptive_rank_start(),
make_adaptive_judge_llm(),
summarize_adaptive()
dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 3) save_adaptive_session(state, dir, overwrite = TRUE) restored <- adaptive_rank_resume(dir) summarize_adaptive(restored)dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 3) save_adaptive_session(state, dir, overwrite = TRUE) restored <- adaptive_rank_resume(dir) summarize_adaptive(restored)
Execute stepwise adaptive ranking with a user-supplied judge.
adaptive_rank_run_live( state, judge, n_steps = 1L, fit_fn = NULL, adaptive_config = NULL, btl_config = NULL, session_dir = NULL, persist_item_log = NULL, progress = c("all", "refits", "steps", "none"), progress_redraw_every = 10L, progress_show_events = TRUE, progress_errors = TRUE, ... )adaptive_rank_run_live( state, judge, n_steps = 1L, fit_fn = NULL, adaptive_config = NULL, btl_config = NULL, session_dir = NULL, persist_item_log = NULL, progress = c("all", "refits", "steps", "none"), progress_redraw_every = 10L, progress_show_events = TRUE, progress_errors = TRUE, ... )
state |
An adaptive state object created by |
judge |
A function called as |
n_steps |
Maximum number of attempted adaptive steps to execute in this call. The run may terminate earlier if candidate starvation is encountered or if BTL stopping criteria are met at a refit. Each attempted step counts toward this budget, including invalid judge responses. |
fit_fn |
Optional BTL fit function for deterministic testing; defaults
to |
adaptive_config |
Optional named list overriding adaptive controller
behavior. Supported fields:
|
btl_config |
Optional named list overriding BTL refit cadence, stopping thresholds, and selected round-log diagnostics. Supported fields:
Defaults are resolved from the current item count, then merged with user overrides. |
session_dir |
Optional directory for saving session artifacts. |
persist_item_log |
Logical; when TRUE, write per-refit item logs to disk. |
progress |
Progress output: "all", "refits", "steps", or "none". |
progress_redraw_every |
Redraw progress bar every N steps. |
progress_show_events |
Logical; when TRUE, print notable step events. |
progress_errors |
Logical; when TRUE, include invalid-step events. |
... |
Additional arguments passed through to |
Each iteration attempts at most one pair evaluation ("one-pair step"), then
applies transactional updates if and only if the judge response is valid.
Invalid responses produce a logged step with
pair_id = NA and must not update committed-comparison state.
Pair selection is TrueSkill-based and does not use BTL posterior draws. Utility is based on
with exploration/exploitation routing and
fallback handling recorded in step_log.
Round scheduling uses stage-specific admissibility:
rolling-anchor links compare one anchor and one non-anchor endpoint;
long/mid links exclude anchor endpoints and enforce stratum-distance bounds;
local-link routing admits same-stratum pairs and anchor-involving pairs within local stage bounds.
Exposure and repeat handling are soft, stage-local constraints:
under-represented exploration uses degree set deg <= D_min + 1, while
repeat-pressure gating uses bottom-quantile recent_deg (default quantile
0.25) and per-endpoint repeat-slot accounting against
repeat_in_round_budget.
Top-band defaults for stratum construction are
top_band_pct = 0.10 and top_band_bins = 5, with top-band size
ceiling(top_band_pct * N).
Bayesian BTL refits are triggered on step-based cadence and evaluated with
diagnostics gates (including ESS thresholds), reliability, and lagged
stability criteria. Refit-level outcomes are
appended to round_log; per-item posterior summaries are appended to
item_log. Controller behavior can change after refits via
identifiability-gated settings in adaptive_config; those controls
affect pair routing and quotas, while BTL remains inference-only.
An updated adaptive_state. The returned state includes
appended step_log rows for attempted steps and, when refits occur,
appended round_log and item_log entries.
adaptive_rank_start(), adaptive_rank_resume(),
adaptive_step_log(), adaptive_round_log(), adaptive_item_log()
Other adaptive ranking:
adaptive_rank(),
adaptive_rank_resume(),
adaptive_rank_start(),
make_adaptive_judge_llm(),
summarize_adaptive()
# ------------------------------------------------------------------ # Offline end-to-end workflow (fast, deterministic, CRAN-safe) # ------------------------------------------------------------------ data("example_writing_samples", package = "pairwiseLLM") items <- dplyr::rename( example_writing_samples[1:8, c("ID", "text", "quality_score")], item_id = ID ) # Use the package defaults for trait and prompt template. trait <- trait_description("overall_quality") prompt_template <- set_prompt_template() # Deterministic local judge based on fixture quality scores. sim_judge <- function(A, B, state, ...) { y <- as.integer(A$quality_score[[1]] >= B$quality_score[[1]]) list(is_valid = TRUE, Y = y, invalid_reason = NA_character_) } session_dir <- tempfile("pwllm-adaptive-session-") state <- adaptive_rank_start( items = items, seed = 42, adaptive_config = list( global_identified_reliability_min = 0.85, star_override_budget_per_round = 2L ), session_dir = session_dir, persist_item_log = TRUE ) state <- adaptive_rank_run_live( state = state, judge = sim_judge, n_steps = 6, btl_config = list( # Keep examples lightweight while showing custom stop config inputs. refit_pairs_target = 50L, ess_bulk_min = 400, eap_reliability_min = 0.90 ), adaptive_config = list( explore_taper_mult = 0.40, boundary_frac = 0.20 ), progress = "steps", progress_redraw_every = 1L, progress_show_events = TRUE, progress_errors = TRUE ) # Print and inspect run outputs. print(state) run_summary <- summarize_adaptive(state) step_view <- adaptive_step_log(state) logs <- adaptive_get_logs(state) run_summary head(step_view) names(logs) # Resume from disk and continue. resumed <- adaptive_rank_resume(session_dir) resumed <- adaptive_rank_run_live( state = resumed, judge = sim_judge, n_steps = 4, progress = "none" ) summarize_adaptive(resumed) # ------------------------------------------------------------------ # Live OpenAI workflow via backend-agnostic llm_compare_pair() # ------------------------------------------------------------------ ## Not run: # Requires network + OPENAI_API_KEY. This incurs API cost. # check_llm_api_keys() is a quick preflight. check_llm_api_keys() data("example_writing_samples", package = "pairwiseLLM") live_items <- dplyr::rename( example_writing_samples[1:12, c("ID", "text")], item_id = ID ) # Default trait/template setup used by the backend-agnostic runner. trait <- trait_description("overall_quality") prompt_template <- set_prompt_template() live_session_dir <- file.path(tempdir(), "pwllm-adaptive-openai") judge_openai <- function(A, B, state, ...) { res <- llm_compare_pair( ID1 = A$item_id[[1]], text1 = A$text[[1]], ID2 = B$item_id[[1]], text2 = B$text[[1]], model = "gpt-5.1", trait_name = trait$name, trait_description = trait$description, prompt_template = prompt_template, backend = "openai", endpoint = "responses", reasoning = "low", service_tier = "flex", include_thoughts = FALSE, temperature = NULL, top_p = NULL, logprobs = NULL ) better_id <- res$better_id[[1]] ok_ids <- c(A$item_id[[1]], B$item_id[[1]]) if (is.na(better_id) || !(better_id %in% ok_ids)) { return(list( is_valid = FALSE, Y = NA_integer_, invalid_reason = "model_response_invalid" )) } list( is_valid = TRUE, Y = as.integer(identical(better_id, A$item_id[[1]])), invalid_reason = NA_character_ ) } state_live <- adaptive_rank_start( items = live_items, seed = 2026, session_dir = live_session_dir, persist_item_log = TRUE ) state_live <- adaptive_rank_run_live( state = state_live, judge = judge_openai, n_steps = 120L, btl_config = list( refit_pairs_target = 20L, ess_bulk_min = 500, ess_bulk_min_near_stop = 1200, max_rhat = 1.01, divergences_max = 0L, eap_reliability_min = 0.92, stability_lag = 2L, theta_corr_min = 0.97, theta_sd_rel_change_max = 0.08, rank_spearman_min = 0.97 ), progress = "all", progress_redraw_every = 1L, progress_show_events = TRUE, progress_errors = TRUE ) # Reporting outputs for end users. print(state_live) run_summary <- summarize_adaptive(state_live) refit_summary <- summarize_refits(state_live) item_summary <- summarize_items(state_live) logs <- adaptive_get_logs(state_live) # Store outputs for audit/reproducibility. saveRDS( list( run_summary = run_summary, refit_summary = refit_summary, item_summary = item_summary, logs = logs ), file.path(live_session_dir, "adaptive_outputs.rds") ) # Resume from stored state and continue sampling. state_live <- adaptive_rank_resume(live_session_dir) state_live <- adaptive_rank_run_live( state = state_live, judge = judge_openai, n_steps = 40L, progress = "refits" ) print(summarize_adaptive(state_live)) ## End(Not run)# ------------------------------------------------------------------ # Offline end-to-end workflow (fast, deterministic, CRAN-safe) # ------------------------------------------------------------------ data("example_writing_samples", package = "pairwiseLLM") items <- dplyr::rename( example_writing_samples[1:8, c("ID", "text", "quality_score")], item_id = ID ) # Use the package defaults for trait and prompt template. trait <- trait_description("overall_quality") prompt_template <- set_prompt_template() # Deterministic local judge based on fixture quality scores. sim_judge <- function(A, B, state, ...) { y <- as.integer(A$quality_score[[1]] >= B$quality_score[[1]]) list(is_valid = TRUE, Y = y, invalid_reason = NA_character_) } session_dir <- tempfile("pwllm-adaptive-session-") state <- adaptive_rank_start( items = items, seed = 42, adaptive_config = list( global_identified_reliability_min = 0.85, star_override_budget_per_round = 2L ), session_dir = session_dir, persist_item_log = TRUE ) state <- adaptive_rank_run_live( state = state, judge = sim_judge, n_steps = 6, btl_config = list( # Keep examples lightweight while showing custom stop config inputs. refit_pairs_target = 50L, ess_bulk_min = 400, eap_reliability_min = 0.90 ), adaptive_config = list( explore_taper_mult = 0.40, boundary_frac = 0.20 ), progress = "steps", progress_redraw_every = 1L, progress_show_events = TRUE, progress_errors = TRUE ) # Print and inspect run outputs. print(state) run_summary <- summarize_adaptive(state) step_view <- adaptive_step_log(state) logs <- adaptive_get_logs(state) run_summary head(step_view) names(logs) # Resume from disk and continue. resumed <- adaptive_rank_resume(session_dir) resumed <- adaptive_rank_run_live( state = resumed, judge = sim_judge, n_steps = 4, progress = "none" ) summarize_adaptive(resumed) # ------------------------------------------------------------------ # Live OpenAI workflow via backend-agnostic llm_compare_pair() # ------------------------------------------------------------------ ## Not run: # Requires network + OPENAI_API_KEY. This incurs API cost. # check_llm_api_keys() is a quick preflight. check_llm_api_keys() data("example_writing_samples", package = "pairwiseLLM") live_items <- dplyr::rename( example_writing_samples[1:12, c("ID", "text")], item_id = ID ) # Default trait/template setup used by the backend-agnostic runner. trait <- trait_description("overall_quality") prompt_template <- set_prompt_template() live_session_dir <- file.path(tempdir(), "pwllm-adaptive-openai") judge_openai <- function(A, B, state, ...) { res <- llm_compare_pair( ID1 = A$item_id[[1]], text1 = A$text[[1]], ID2 = B$item_id[[1]], text2 = B$text[[1]], model = "gpt-5.1", trait_name = trait$name, trait_description = trait$description, prompt_template = prompt_template, backend = "openai", endpoint = "responses", reasoning = "low", service_tier = "flex", include_thoughts = FALSE, temperature = NULL, top_p = NULL, logprobs = NULL ) better_id <- res$better_id[[1]] ok_ids <- c(A$item_id[[1]], B$item_id[[1]]) if (is.na(better_id) || !(better_id %in% ok_ids)) { return(list( is_valid = FALSE, Y = NA_integer_, invalid_reason = "model_response_invalid" )) } list( is_valid = TRUE, Y = as.integer(identical(better_id, A$item_id[[1]])), invalid_reason = NA_character_ ) } state_live <- adaptive_rank_start( items = live_items, seed = 2026, session_dir = live_session_dir, persist_item_log = TRUE ) state_live <- adaptive_rank_run_live( state = state_live, judge = judge_openai, n_steps = 120L, btl_config = list( refit_pairs_target = 20L, ess_bulk_min = 500, ess_bulk_min_near_stop = 1200, max_rhat = 1.01, divergences_max = 0L, eap_reliability_min = 0.92, stability_lag = 2L, theta_corr_min = 0.97, theta_sd_rel_change_max = 0.08, rank_spearman_min = 0.97 ), progress = "all", progress_redraw_every = 1L, progress_show_events = TRUE, progress_errors = TRUE ) # Reporting outputs for end users. print(state_live) run_summary <- summarize_adaptive(state_live) refit_summary <- summarize_refits(state_live) item_summary <- summarize_items(state_live) logs <- adaptive_get_logs(state_live) # Store outputs for audit/reproducibility. saveRDS( list( run_summary = run_summary, refit_summary = refit_summary, item_summary = item_summary, logs = logs ), file.path(live_session_dir, "adaptive_outputs.rds") ) # Resume from stored state and continue sampling. state_live <- adaptive_rank_resume(live_session_dir) state_live <- adaptive_rank_run_live( state = state_live, judge = judge_openai, n_steps = 40L, progress = "refits" ) print(summarize_adaptive(state_live)) ## End(Not run)
Initialize an adaptive ranking session and canonical state object.
adaptive_rank_start( items, seed = 1L, session_dir = NULL, persist_item_log = FALSE, ..., adaptive_config = NULL )adaptive_rank_start( items, seed = 1L, session_dir = NULL, persist_item_log = FALSE, ..., adaptive_config = NULL )
items |
A vector or data frame of items. Data frames must include an
|
seed |
Integer seed used for deterministic warm-start shuffling and selection randomness. |
session_dir |
Optional directory for saving session artifacts. |
persist_item_log |
Logical; when TRUE, write per-refit item logs to disk. |
... |
Internal/testing only. Supply |
adaptive_config |
Optional named list overriding adaptive controller behavior. Supported fields:
Unknown fields and invalid values abort with an actionable error. |
This function creates the stepwise controller state and seeds all canonical
logs used in the adaptive pairing workflow. Warm start pair construction
follows the shuffled chain design, which guarantees a connected comparison
graph after committed comparisons.
Pair selection in this framework is TrueSkill-driven and uses base utility
where is the current TrueSkill
win probability for pair . Bayesian
BTL posterior draws are not used for pair selection; they are used for
posterior inference, diagnostics, and stopping at refit rounds.
The returned state contains canonical logs:
step_log: one row per attempted step,
round_log: one row per posterior refit,
item_log: per-item posterior summaries by refit.
If session_dir is supplied, the initialized state is persisted
immediately using save_adaptive_session().
An adaptive state object containing step_log, round_log, and
item_log. The object includes class "adaptive_state", item ID
mappings, TrueSkill state, warm-start queue, refit metadata, and runtime
configuration.
adaptive_rank_run_live(), adaptive_rank_resume(),
adaptive_step_log(), adaptive_round_log(), adaptive_item_log()
Other adaptive ranking:
adaptive_rank(),
adaptive_rank_resume(),
adaptive_rank_run_live(),
make_adaptive_judge_llm(),
summarize_adaptive()
state <- adaptive_rank_start(c("a", "b", "c"), seed = 11) summarize_adaptive(state)state <- adaptive_rank_start(c("a", "b", "c"), seed = 11) summarize_adaptive(state)
Adaptive results history in build_bt_data() format.
adaptive_results_history(state, committed_only = TRUE)adaptive_results_history(state, committed_only = TRUE)
state |
Adaptive state. |
committed_only |
Use only committed comparisons. |
Converts adaptive step outcomes into the three-column format used by
build_bt_data() (object1, object2, result). With
committed_only = TRUE, only committed steps (pair_id not
missing) are retained. This preserves the transactional invariant that
invalid steps do not contribute to inferred comparisons.
A tibble with columns:
Character item id shown in position A.
Character item id shown in position B.
Numeric outcome in {0, 1} where 1 means
object1 wins.
build_bt_data(), adaptive_step_log()
Other adaptive logs:
adaptive_get_logs(),
adaptive_item_log(),
adaptive_round_log(),
adaptive_step_log()
state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_results_history(state)state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_results_history(state)
Adaptive round log accessor.
adaptive_round_log(state)adaptive_round_log(state)
state |
Adaptive state. |
round_log is the canonical per-refit audit log for the adaptive
pairing workflow.
Each row summarizes one Bayesian BTL refit and includes
diagnostics, reliability, and stopping-gate fields used to justify stop
decisions.
Core columns:
Refit identity/state: refit_id, round_id_at_refit,
step_id_at_refit, timestamp, model_variant,
n_items, total_pairs_done, new_pairs_since_last_refit,
n_unique_pairs_seen.
Candidate health: proposed_pairs_mode,
starve_rate_since_last_refit, fallback_rate_since_last_refit,
fallback_used_mode, starvation_reason_mode.
Identifiability/quota adaptation: global_identified,
global_identified_reliability_min,
global_identified_rank_corr_min, long_quota_raw,
long_quota_effective, long_quota_removed,
realloc_to_mid, realloc_to_local.
Coverage/imbalance: mean_degree, min_degree,
pos_balance_sd, star_cap_rejects_since_last_refit,
star_cap_reject_rate_since_last_refit,
recent_deg_median_since_last_refit,
recent_deg_max_since_last_refit.
Posterior parameter summaries:
epsilon_mean/percentiles and b_mean/percentiles.
Audit diagnostics: ts_sigma_mean, ts_sigma_max,
ts_degree_sigma_corr, ts_btl_theta_corr,
ts_btl_rank_spearman, ci95_theta_width_*,
near_tie_adj_frac, near_tie_adj_count, p_adj_median,
cov_trace_theta, cov_logdet_diag_theta,
post_sd_theta_p10, post_sd_theta_p50,
post_sd_theta_p90, top20_boundary_entropy_*,
nn_diff_sd_*.
Stopping diagnostics: diagnostics_pass,
diagnostics_divergences_pass, diagnostics_rhat_pass,
diagnostics_ess_pass, divergences,
divergences_max_allowed, max_rhat,
max_rhat_allowed, min_ess_bulk,
ess_bulk_required, near_stop_active,
reliability_EAP, eap_reliability_min, eap_pass,
theta_sd_eap, rho_theta, lag_eligible,
theta_corr_min, theta_corr_pass, delta_sd_theta,
theta_sd_rel_change_max, delta_sd_theta_pass,
rho_rank, rank_spearman_min, rho_rank_pass.
Refit execution metadata: mcmc_chains,
mcmc_parallel_chains, mcmc_core_fraction,
mcmc_cores_detected_physical, mcmc_cores_detected_logical,
mcmc_threads_per_chain, mcmc_cmdstanr_version.
Stop output: stop_decision, stop_reason.
A tibble with one row per completed posterior refit round.
adaptive_get_logs(), summarize_refits(), adaptive_rank_run_live()
Other adaptive logs:
adaptive_get_logs(),
adaptive_item_log(),
adaptive_results_history(),
adaptive_step_log()
state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_round_log(state)state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_round_log(state)
Adaptive step log accessor.
adaptive_step_log(state)adaptive_step_log(state)
state |
Adaptive state. |
step_log is the canonical per-step audit log for the adaptive
workflow. It records candidate pipeline outcomes, selected pair/order, and
commit status. A step with invalid judge response keeps committed fields
as NA and must not update model state.
Core columns:
Identity/outcome: step_id, timestamp, pair_id,
i, j, A, B, Y, status.
Routing/scheduling: round_id, round_stage,
pair_type, stage_committed_so_far, stage_quota.
Exposure/strata: used_in_round_i, used_in_round_j,
is_anchor_i, is_anchor_j, stratum_i,
stratum_j, dist_stratum.
Candidate health: is_explore_step, explore_mode,
explore_reason, explore_rate_used,
local_priority_mode, long_gate_pass,
long_gate_reason, star_override_used,
star_override_reason, candidate_starved,
fallback_used, fallback_path, starvation_reason.
Candidate counts: n_candidates_generated,
n_candidates_after_hard_filters, n_candidates_after_duplicates,
n_candidates_after_star_caps, n_candidates_scored.
Endpoint diagnostics: deg_i, deg_j,
recent_deg_i, recent_deg_j, mu_i, mu_j,
sigma_i, sigma_j, p_ij, U0_ij.
Star-cap diagnostics: star_cap_rejects,
star_cap_reject_items.
A tibble with one row per attempted step, in execution order.
adaptive_get_logs(), adaptive_round_log(), adaptive_rank_run_live()
Other adaptive logs:
adaptive_get_logs(),
adaptive_item_log(),
adaptive_results_history(),
adaptive_round_log()
state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_step_log(state)state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) adaptive_step_log(state)
This helper takes a table of paired writing samples (with columns
ID1, text1, ID2, and text2) and reverses the sample order for
every second row (rows 2, 4, 6, ...). This provides a perfectly balanced
reversal pattern without the randomness of randomize_pair_order().
alternate_pair_order(pairs)alternate_pair_order(pairs)
pairs |
A tibble or data frame with columns |
This is useful when you want a fixed 50/50 mix of original and reversed pairs for bias control, benchmarking, or debugging, without relying on the random number generator or seeds.
A tibble identical to pairs except that rows 2, 4, 6, ...
have ID1/text1 and ID2/text2 swapped.
data("example_writing_samples") pairs <- make_pairs(example_writing_samples) pairs_alt <- alternate_pair_order(pairs) head(pairs[, c("ID1", "ID2")]) head(pairs_alt[, c("ID1", "ID2")])data("example_writing_samples") pairs <- make_pairs(example_writing_samples) pairs_alt <- alternate_pair_order(pairs) head(pairs[, c("ID1", "ID2")]) head(pairs_alt[, c("ID1", "ID2")])
This function sends a single pairwise comparison prompt to the Anthropic Messages API (Claude models) and parses the result into a small tibble.
anthropic_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", api_key = NULL, anthropic_version = "2023-06-01", reasoning = c("none", "enabled"), include_raw = FALSE, include_thoughts = NULL, ... )anthropic_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", api_key = NULL, anthropic_version = "2023-06-01", reasoning = c("none", "enabled"), include_raw = FALSE, include_thoughts = NULL, ... )
ID1 |
Character ID for the first sample. |
text1 |
Character string containing the first sample's text. |
ID2 |
Character ID for the second sample. |
text2 |
Character string containing the second sample's text. |
model |
Anthropic Claude model name (for example
|
trait_name |
Short label for the trait (for example "Overall Quality"). |
trait_description |
Full-text definition of the trait. |
prompt_template |
Prompt template string, typically from
|
tag_prefix |
Prefix for the better-sample tag. Defaults to
|
tag_suffix |
Suffix for the better-sample tag. Defaults to
|
api_key |
Optional Anthropic API key. Defaults to
|
anthropic_version |
Anthropic API version string passed as the
|
reasoning |
Character scalar indicating whether to allow more extensive internal "thinking" before the visible answer. Two values are recognised:
|
include_raw |
Logical; if |
include_thoughts |
Logical or |
... |
Additional Anthropic parameters such as When
When
|
It mirrors the behaviour and output schema of
openai_compare_pair_live, but targets Anthropic's
/v1/messages endpoint. The prompt template, <BETTER_SAMPLE> tag
convention, and downstream parsing / BT modelling can remain unchanged.
The function is designed to work with Claude models such as Sonnet, Haiku, and Opus in the "4.5" family. You can pass any valid Anthropic model string, for example:
"claude-sonnet-4-5"
"claude-haiku-4-5"
"claude-opus-4-5"
The API typically responds with a dated model string such as
"claude-sonnet-4-5-20250929" in the model field.
Recommended defaults for pairwise writing comparisons
For stable, reproducible comparisons we recommend:
reasoning = "none" with temperature = 0 and
max_tokens = 768 for standard pairwise scoring.
reasoning = "enabled" when you explicitly want extended
thinking; in this mode Anthropic requires temperature = 1.
The default in this function is max_tokens = 2048 and
thinking_budget_tokens = 1024, which satisfies the documented
constraints thinking_budget_tokens >= 1024 and
thinking_budget_tokens < max_tokens.
When reasoning = "enabled", this function also sends a
thinking block to the Anthropic API:
"thinking": {
"type": "enabled",
"budget_tokens": <thinking_budget_tokens>
}
Setting include_thoughts = TRUE when reasoning = "none"
is a convenient way to opt into Anthropic's extended thinking mode without
changing the reasoning argument explicitly. In that case,
reasoning is upgraded to "enabled", the default
temperature becomes 1, and a thinking block is included in the
request. When reasoning = "none" and include_thoughts is
FALSE or NULL, the default temperature remains 0 unless
you explicitly override it.
A tibble with one row and columns:
Stable ID for the pair (pair_uid if supplied via
...; otherwise "LIVE_<ID1>_vs_<ID2>").
The sample IDs you supplied.
Model name reported by the API.
Anthropic object type (for example "message").
HTTP-style status code (200 if successful).
Error message if something goes wrong; otherwise NA.
Summarised thinking / reasoning text when
reasoning = "enabled" and the API returns thinking blocks;
otherwise NA.
Concatenated text from the assistant output (excluding thinking blocks).
"SAMPLE_1", "SAMPLE_2", or NA.
ID1 if SAMPLE_1 is chosen, ID2 if SAMPLE_2 is chosen, otherwise NA.
Prompt / input token count (if reported).
Completion / output token count (if reported).
Total token count (reported by the API or computed as input + output tokens when not provided).
(Optional) list-column containing the parsed JSON body.
## Not run: # Requires ANTHROPIC_API_KEY and network access. library(pairwiseLLM) data("example_writing_samples", package = "pairwiseLLM") samples <- example_writing_samples[1:2, ] td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Short, deterministic comparison with no explicit thinking block res_claude <- anthropic_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none", include_raw = FALSE ) res_claude$better_id # Allow more internal thinking and a longer explanation res_claude_reason <- anthropic_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "enabled", include_raw = TRUE, include_thoughts = TRUE ) res_claude_reason$total_tokens substr(res_claude_reason$content, 1, 200) ## End(Not run)## Not run: # Requires ANTHROPIC_API_KEY and network access. library(pairwiseLLM) data("example_writing_samples", package = "pairwiseLLM") samples <- example_writing_samples[1:2, ] td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Short, deterministic comparison with no explicit thinking block res_claude <- anthropic_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none", include_raw = FALSE ) res_claude$better_id # Allow more internal thinking and a longer explanation res_claude_reason <- anthropic_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "enabled", include_raw = TRUE, include_thoughts = TRUE ) res_claude_reason$total_tokens substr(res_claude_reason$content, 1, 200) ## End(Not run)
This is a thin wrapper around Anthropic's
/v1/messages/batches endpoint. It accepts a list of request
objects (each with custom_id and params) and returns the
resulting Message Batch object.
anthropic_create_batch( requests, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01" )anthropic_create_batch( requests, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01" )
requests |
List of request objects, each of the form
|
api_key |
Optional Anthropic API key. Defaults to
|
anthropic_version |
Anthropic API version string passed as the
|
Typically you will not call this directly; instead, use
run_anthropic_batch_pipeline which builds requests from a
tibble of pairs, creates the batch, polls for completion, and downloads
the results.
A list representing the Message Batch object returned by Anthropic.
Important fields include id, processing_status,
request_counts, and (after completion) results_url.
## Not run: # Requires ANTHROPIC_API_KEY and network access. library(pairwiseLLM) data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 2, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() req_tbl <- build_anthropic_batch_requests( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) requests <- lapply(seq_len(nrow(req_tbl)), function(i) { list( custom_id = req_tbl$custom_id[i], params = req_tbl$params[[i]] ) }) batch <- anthropic_create_batch(requests = requests) batch$id batch$processing_status ## End(Not run)## Not run: # Requires ANTHROPIC_API_KEY and network access. library(pairwiseLLM) data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 2, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() req_tbl <- build_anthropic_batch_requests( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) requests <- lapply(seq_len(nrow(req_tbl)), function(i) { list( custom_id = req_tbl$custom_id[i], params = req_tbl$params[[i]] ) }) batch <- anthropic_create_batch(requests = requests) batch$id batch$processing_status ## End(Not run)
Once a Message Batch has finished processing (status "ended"),
Anthropic exposes a results_url field pointing to a .jsonl
file containing one JSON object per request result.
anthropic_download_batch_results( batch_id, output_path, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01" )anthropic_download_batch_results( batch_id, output_path, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01" )
batch_id |
Character scalar giving the batch ID. |
output_path |
File path where the |
api_key |
Optional Anthropic API key. Defaults to
|
anthropic_version |
Anthropic API version string passed as the
|
This helper downloads that file and writes it to disk. It is the
Anthropic counterpart to openai_download_batch_output().
Invisibly, the output_path.
## Not run: # Requires ANTHROPIC_API_KEY and network access. final <- anthropic_poll_batch_until_complete(batch$id) jsonl_path <- tempfile(fileext = ".jsonl") anthropic_download_batch_results(final$id, jsonl_path) ## End(Not run)## Not run: # Requires ANTHROPIC_API_KEY and network access. final <- anthropic_poll_batch_until_complete(batch$id) jsonl_path <- tempfile(fileext = ".jsonl") anthropic_download_batch_results(final$id, jsonl_path) ## End(Not run)
This retrieves the latest state of a Message Batch using its id.
It corresponds to a GET request on
/v1/messages/batches/<MESSAGE_BATCH_ID>.
anthropic_get_batch( batch_id, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01" )anthropic_get_batch( batch_id, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01" )
batch_id |
Character scalar giving the batch ID (for example
|
api_key |
Optional Anthropic API key. Defaults to
|
anthropic_version |
Anthropic API version string passed as the
|
A list representing the Message Batch object, including fields
such as id, processing_status, request_counts,
and (after completion) results_url.
## Not run: # Requires ANTHROPIC_API_KEY and network access. # After creating a batch: batch <- anthropic_create_batch(requests = my_requests) batch_id <- batch$id latest <- anthropic_get_batch(batch_id) latest$processing_status ## End(Not run)## Not run: # Requires ANTHROPIC_API_KEY and network access. # After creating a batch: batch <- anthropic_create_batch(requests = my_requests) batch_id <- batch$id latest <- anthropic_get_batch(batch_id) latest$processing_status ## End(Not run)
This helper repeatedly calls anthropic_get_batch until
the batch's processing_status becomes "ended" or a time
limit is reached. It is analogous to
openai_poll_batch_until_complete() but for Anthropic's
Message Batches API.
anthropic_poll_batch_until_complete( batch_id, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01", verbose = TRUE )anthropic_poll_batch_until_complete( batch_id, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01", verbose = TRUE )
batch_id |
Character scalar giving the batch ID. |
interval_seconds |
Polling interval in seconds. Defaults to 60. |
timeout_seconds |
Maximum total waiting time in seconds. Defaults to
24 hours ( |
api_key |
Optional Anthropic API key. Defaults to
|
anthropic_version |
Anthropic API version string passed as the
|
verbose |
Logical; if |
The final Message Batch object as returned by
anthropic_get_batch once processing_status == "ended"
or the last object retrieved before timing out.
## Not run: # Requires ANTHROPIC_API_KEY and network access. batch <- anthropic_create_batch(requests = my_requests) final <- anthropic_poll_batch_until_complete(batch$id, interval_seconds = 30) final$processing_status ## End(Not run)## Not run: # Requires ANTHROPIC_API_KEY and network access. batch <- anthropic_create_batch(requests = my_requests) final <- anthropic_poll_batch_until_complete(batch$id, interval_seconds = 30) final$processing_status ## End(Not run)
This helper converts a tibble of writing pairs into a list of Anthropic
Message Batch requests. Each request has a unique custom_id
of the form "ANTH_<ID1>_vs_<ID2>" and a params object
compatible with the /v1/messages API.
build_anthropic_batch_requests( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), reasoning = c("none", "enabled"), custom_id_prefix = "ANTH", ... )build_anthropic_batch_requests( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), reasoning = c("none", "enabled"), custom_id_prefix = "ANTH", ... )
pairs |
Tibble or data frame with at least columns |
model |
Anthropic Claude model name, for example
|
trait_name |
Short label for the trait (for example "Overall Quality"). |
trait_description |
Full-text description of the trait or rubric. |
prompt_template |
Prompt template string, typically from
|
reasoning |
Character scalar indicating whether to allow extended
thinking; one of |
custom_id_prefix |
Prefix for the |
... |
Additional Anthropic parameters such as |
The function mirrors the behaviour of
build_openai_batch_requests but targets Anthropic's
/v1/messages/batches endpoint. It applies the
same recommended defaults and reasoning constraints as
anthropic_compare_pair_live:
reasoning = "none":
Default temperature = 0 (deterministic behaviour),
unless you explicitly supply a different temperature via
....
Default max_tokens = 768, unless overridden via
max_tokens in ....
reasoning = "enabled" (extended thinking):
temperature must be 1. If you supply a different
value in ..., this function throws an error.
Defaults to max_tokens = 2048 and
thinking_budget_tokens = 1024, with the constraint
1024 <= thinking_budget_tokens < max_tokens. Violations of
this constraint produce an error.
As a result, when you build batches without extended thinking
(reasoning = "none"), the effective default temperature is 0. When
you opt into extended thinking (reasoning = "enabled"), Anthropic's
requirement of temperature = 1 is enforced for all batch requests.
A tibble with one row per pair and two main columns:
Character ID of the form
"<PREFIX>_<ID1>_vs_<ID2>".
List-column containing the Anthropic Messages API
params object for each request, ready to be used in the
requests array of /v1/messages/batches.
data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 3, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Standard batch requests without extended thinking reqs_none <- build_anthropic_batch_requests( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none" ) reqs_none # Batch requests with extended thinking reqs_reason <- build_anthropic_batch_requests( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "enabled" ) reqs_reasondata("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 3, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Standard batch requests without extended thinking reqs_none <- build_anthropic_batch_requests( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none" ) reqs_none # Batch requests with extended thinking reqs_reason <- build_anthropic_batch_requests( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "enabled" ) reqs_reason
This function converts pairwise comparison results into the three-column format commonly used for Bradley-Terry models: the first two columns contain object labels and the third column contains the comparison result (1 for a win of the first object, 0 for a win of the second).
build_bt_data(results)build_bt_data(results)
results |
A data frame or tibble with either
|
It accepts either:
legacy columns ID1, ID2, better_id, or
canonical columns A_id, B_id, better_id.
Rows where better_id does not match either side of the pair
(including NA) are excluded.
A tibble with three columns:
object1: ID from ID1
object2: ID from ID2
result: numeric value, 1 if better_id == ID1,
0 if better_id == ID2
Rows with invalid or missing better_id are dropped.
results <- tibble::tibble( ID1 = c("S1", "S1", "S2"), ID2 = c("S2", "S3", "S3"), better_id = c("S1", "S3", "S2") ) bt_data <- build_bt_data(results) bt_data # Using the example writing pairs data("example_writing_pairs") bt_ex <- build_bt_data(example_writing_pairs) head(bt_ex)results <- tibble::tibble( ID1 = c("S1", "S1", "S2"), ID2 = c("S2", "S3", "S3"), better_id = c("S1", "S3", "S2") ) bt_data <- build_bt_data(results) bt_data # Using the example writing pairs data("example_writing_pairs") bt_ex <- build_bt_data(example_writing_pairs) head(bt_ex)
results_tbl data for Bayesian BTL MCMCConverts non-adaptive pairwise outcomes (for example, rows like
example_writing_pairs with ID1, ID2, better_id)
into the canonical results_tbl schema required by
fit_bayes_btl_mcmc().
build_btl_results_data( results, phase = "phase2", backend = "non_adaptive_import", model = "unknown", iter_start = 1L, received_at_start = as.POSIXct("1970-01-01 00:00:00", tz = "UTC") )build_btl_results_data( results, phase = "phase2", backend = "non_adaptive_import", model = "unknown", iter_start = 1L, received_at_start = as.POSIXct("1970-01-01 00:00:00", tz = "UTC") )
results |
A data frame or tibble containing columns |
phase |
Length-1 phase label for all rows. Must be one of
|
backend |
Length-1 backend label to record in output metadata. |
model |
Length-1 model label to record in output metadata. |
iter_start |
Integer starting value for |
received_at_start |
Length-1 |
The output is deterministic and schema-valid:
stable unordered_key / ordered_key values,
deterministic pair_uid as "<unordered_key>#<occurrence>",
deterministic iter and received_at sequences.
A tibble in canonical results_tbl format with columns:
pair_uid, unordered_key, ordered_key, A_id,
B_id, better_id, winner_pos, phase,
iter, received_at, backend, model.
data("example_writing_pairs", package = "pairwiseLLM") results_tbl <- build_btl_results_data(example_writing_pairs) head(results_tbl) ids <- sort(unique(c(results_tbl$A_id, results_tbl$B_id))) idsdata("example_writing_pairs", package = "pairwiseLLM") results_tbl <- build_btl_results_data(example_writing_pairs) head(results_tbl) ids <- sort(unique(c(results_tbl$A_id, results_tbl$B_id))) ids
This function converts pairwise comparison results into the two-column format used by the EloChoice package: one column for the winner and one for the loser of each trial.
build_elo_data(results)build_elo_data(results)
results |
A data frame or tibble with either
|
It accepts either:
legacy columns ID1, ID2, better_id, or
canonical columns A_id, B_id, better_id.
Rows where better_id does not match either side of the pair
(including NA) are excluded.
A tibble with two columns:
winner: ID of the winning sample
loser: ID of the losing sample
Rows with invalid or missing better_id are dropped.
results <- tibble::tibble( ID1 = c("S1", "S1", "S2", "S3"), ID2 = c("S2", "S3", "S3", "S4"), better_id = c("S1", "S3", "S2", "S4") ) elo_data <- build_elo_data(results) elo_dataresults <- tibble::tibble( ID1 = c("S1", "S1", "S2", "S3"), ID2 = c("S2", "S3", "S3", "S4"), better_id = c("S1", "S3", "S2", "S4") ) elo_data <- build_elo_data(results) elo_data
This helper converts a tibble of writing pairs into a set of Gemini
GenerateContent requests suitable for use with the Batch API
(models/*:batchGenerateContent).
build_gemini_batch_requests( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), thinking_level = "low", custom_id_prefix = "GEM", temperature = NULL, top_p = NULL, top_k = NULL, max_output_tokens = NULL, include_thoughts = FALSE, ... )build_gemini_batch_requests( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), thinking_level = "low", custom_id_prefix = "GEM", temperature = NULL, top_p = NULL, top_k = NULL, max_output_tokens = NULL, include_thoughts = FALSE, ... )
pairs |
Tibble or data frame with at least columns |
model |
Gemini model name, for example |
trait_name |
Short label for the trait (for example "Overall Quality"). |
trait_description |
Full-text description of the trait or rubric. |
prompt_template |
Prompt template string, typically from
|
thinking_level |
One of This is mapped to Gemini's
|
custom_id_prefix |
Prefix for the |
temperature |
Optional numeric temperature. If |
top_p |
Optional nucleus sampling parameter. If |
top_k |
Optional top-k sampling parameter. If |
max_output_tokens |
Optional integer. If |
include_thoughts |
Logical; if |
... |
Reserved for future extensions. Any |
Each pair receives a unique custom_id of the form
"GEM_<ID1>_vs_<ID2>" and a corresponding request object containing
the prompt and generation configuration.
A tibble with one row per pair and two main columns:
Character ID of the form
"<PREFIX>_<ID1>_vs_<ID2>".
List-column containing the Gemini GenerateContent request object for each pair.
data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 3, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Gemini 3 Pro example (existing behavior) reqs <- build_gemini_batch_requests( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low", include_thoughts = TRUE ) reqs # Gemini 3 Flash example (minimal thinking) reqs_flash <- build_gemini_batch_requests( pairs = pairs, model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", include_thoughts = FALSE ) reqs_flashdata("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 3, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Gemini 3 Pro example (existing behavior) reqs <- build_gemini_batch_requests( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low", include_thoughts = TRUE ) reqs # Gemini 3 Flash example (minimal thinking) reqs_flash <- build_gemini_batch_requests( pairs = pairs, model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", include_thoughts = FALSE ) reqs_flash
This helper constructs one JSON object per pair of writing samples,
suitable for use with the OpenAI batch API. It supports both
/v1/chat/completions and /v1/responses endpoints.
build_openai_batch_requests( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), endpoint = c("chat.completions", "responses"), temperature = NULL, top_p = NULL, logprobs = NULL, reasoning = NULL, include_thoughts = FALSE, request_id_prefix = "EXP" )build_openai_batch_requests( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), endpoint = c("chat.completions", "responses"), temperature = NULL, top_p = NULL, logprobs = NULL, reasoning = NULL, include_thoughts = FALSE, request_id_prefix = "EXP" )
pairs |
A data frame or tibble with columns |
model |
Character scalar giving the OpenAI model name.
Supports standard names (e.g. |
trait_name |
Short label for the trait (e.g., "Overall Quality"). |
trait_description |
Full-text definition of the trait. |
prompt_template |
Character template containing the placeholders
|
endpoint |
Which OpenAI endpoint to target. One of
|
temperature |
Optional temperature parameter. Defaults to |
top_p |
Optional top_p parameter. |
logprobs |
Optional logprobs parameter. |
reasoning |
Optional reasoning effort for GPT-5 series when using
the |
include_thoughts |
Logical; if TRUE and using |
request_id_prefix |
String prefix for |
A tibble with one row per pair and columns:
custom_id: ID string used by the batch API.
method: HTTP method ("POST").
url: Endpoint path ("/v1/chat/completions" or
"/v1/responses").
body: List column containing the request body.
data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 3, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Basic chat.completions batch with no thoughts batch_tbl_chat <- build_openai_batch_requests( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, endpoint = "chat.completions", temperature = 0 ) # 2. GPT-5.2-2025-12-11 Responses Batch with Reasoning batch_tbl_resp <- build_openai_batch_requests( pairs = pairs, model = "gpt-5.2-2025-12-11", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, endpoint = "responses", include_thoughts = TRUE, # implies reasoning="low" if not set reasoning = "medium" ) batch_tbl_chat batch_tbl_respdata("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 3, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Basic chat.completions batch with no thoughts batch_tbl_chat <- build_openai_batch_requests( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, endpoint = "chat.completions", temperature = 0 ) # 2. GPT-5.2-2025-12-11 Responses Batch with Reasoning batch_tbl_resp <- build_openai_batch_requests( pairs = pairs, model = "gpt-5.2-2025-12-11", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, endpoint = "responses", include_thoughts = TRUE, # implies reasoning="low" if not set reasoning = "medium" ) batch_tbl_chat batch_tbl_resp
This function takes a prompt template (typically from
set_prompt_template), a trait name and description,
and two writing samples, and fills in the required placeholders.
build_prompt(template, trait_name, trait_desc, text1, text2)build_prompt(template, trait_name, trait_desc, text1, text2)
template |
Character string containing the prompt template. |
trait_name |
Character scalar giving a short label for the trait (e.g., "Overall Quality"). |
trait_desc |
Character scalar giving the full definition of the trait. |
text1 |
Character scalar containing the text for SAMPLE_1. |
text2 |
Character scalar containing the text for SAMPLE_2. |
The template must contain the placeholders:
{TRAIT_NAME}, {TRAIT_DESCRIPTION},
{SAMPLE_1}, and {SAMPLE_2}.
A single character string containing the completed prompt.
tmpl <- set_prompt_template() td <- trait_description("overall_quality") prompt <- build_prompt( template = tmpl, trait_name = td$name, trait_desc = td$description, text1 = "This is sample 1.", text2 = "This is sample 2." ) cat(substr(prompt, 1, 200), "...\n")tmpl <- set_prompt_template() td <- trait_description("overall_quality") prompt <- build_prompt( template = tmpl, trait_name = td$name, trait_desc = td$description, text1 = "This is sample 1.", text2 = "This is sample 2." ) cat(substr(prompt, 1, 200), "...\n")
This function inspects the current R session for configured API keys
used by pairwiseLLM. It checks for known environment variables such as
OPENAI_API_KEY, ANTHROPIC_API_KEY, and GEMINI_API_KEY, and returns
a small tibble summarising which keys are available.
check_llm_api_keys(verbose = TRUE)check_llm_api_keys(verbose = TRUE)
verbose |
Logical; if |
It does not print or return the key values themselves - only whether each key is present. This makes it safe to run in logs, scripts, and shared environments.
A tibble (data frame) with one row per backend and columns:
Short backend identifier, e.g. "openai", "anthropic",
"gemini", "together".
Human-readable service name, e.g. "OpenAI",
"Anthropic", "Google Gemini", "Together.ai".
Name of the environment variable that is checked.
Logical flag indicating whether the key is set and non-empty.
# In an interactive session, quickly check which keys are configured: check_llm_api_keys() # In non-interactive scripts, you can disable messages and just use the # result: status <- check_llm_api_keys(verbose = FALSE) status# In an interactive session, quickly check which keys are configured: check_llm_api_keys() # In non-interactive scripts, you can disable messages and just use the # result: status <- check_llm_api_keys(verbose = FALSE) status
This function diagnoses positional bias in LLM-based paired comparison data and provides a bootstrapped confidence interval for the overall consistency of forward vs. reverse comparisons.
check_positional_bias( consistency, n_boot = 1000, conf_level = 0.95, seed = NULL )check_positional_bias( consistency, n_boot = 1000, conf_level = 0.95, seed = NULL )
consistency |
Either:
|
n_boot |
Integer, number of bootstrap resamples for estimating the distribution of the overall consistency proportion. Default is 1000. |
conf_level |
Confidence level for the bootstrap interval. Default is 0.95. |
seed |
Optional integer seed for reproducible bootstrapping. If
|
It is designed to work with the output of
compute_reverse_consistency, but will also accept a tibble
that looks like its $details component.
A list with two elements:
A tibble with:
n_pairs: number of unordered pairs
prop_consistent: observed proportion of consistent pairs
boot_mean: mean of bootstrap consistency proportions
boot_lwr, boot_upr: bootstrap confidence interval
p_sample1_main: p-value from a binomial test for the
null hypothesis that SAMPLE_1 wins 50\
main (forward) comparisons
p_sample1_rev: analogous p-value for the reverse
comparisons
p_sample1_overall: p-value from a binomial test for
the null that position 1 wins 50\
all (forward + reverse) comparisons
total_pos1_wins: total number of wins by position 1
across forward + reverse comparisons
total_comparisons: total number of valid forward +
reverse comparisons included in the overall test
n_inconsistent: number of pairs with inconsistent
forward vs. reverse outcomes
n_inconsistent_pos1_bias: among inconsistent pairs, how
many times the winner is in position 1 in both directions
n_inconsistent_pos2_bias: analogous for position 2
The input details tibble augmented with:
winner_pos_main: "pos1" or "pos2" (or
NA) indicating which position won in the main direction
winner_pos_rev: analogous for the reversed direction
is_pos1_bias: logical; TRUE if the pair is
inconsistent and position 1 wins in both directions
is_pos2_bias: analogous for position 2
# Simple synthetic example main <- tibble::tibble( ID1 = c("S1", "S1", "S2"), ID2 = c("S2", "S3", "S3"), better_id = c("S1", "S3", "S2") ) rev <- tibble::tibble( ID1 = c("S2", "S3", "S3"), ID2 = c("S1", "S1", "S2"), better_id = c("S1", "S3", "S2") ) rc <- compute_reverse_consistency(main, rev) rc$summary bias <- check_positional_bias(rc) bias$summary# Simple synthetic example main <- tibble::tibble( ID1 = c("S1", "S1", "S2"), ID2 = c("S2", "S3", "S3"), better_id = c("S1", "S3", "S2") ) rev <- tibble::tibble( ID1 = c("S2", "S3", "S3"), ID2 = c("S1", "S1", "S2"), better_id = c("S1", "S3", "S2") ) rc <- compute_reverse_consistency(main, rev) rc$summary bias <- check_positional_bias(rc) bias$summary
Given two data frames of pairwise comparison results (one for the "forward" ordering of pairs, one for the "reverse" ordering), this function identifies unordered pairs that were evaluated in both directions and computes the proportion of consistent judgments.
compute_reverse_consistency(main_results, reverse_results)compute_reverse_consistency(main_results, reverse_results)
main_results |
A data frame or tibble containing pairwise
comparison results for the "forward" ordering of pairs, with
columns |
reverse_results |
A data frame or tibble containing results for the corresponding "reverse" ordering, with the same column requirements. |
Consistency is defined at the level of IDs: a pair is consistent
if the same ID is selected as better in both directions. This function
assumes each input contains columns ID1, ID2, and
better_id, where better_id is the ID of the better sample
(not "SAMPLE_1"/"SAMPLE_2").
Per-key majority agreement (duplicates supported).
If a pair appears multiple times in main_results and/or
reverse_results (e.g., submitted twice), this function aggregates
each unordered pair key separately in each direction and takes the
majority better_id. If there is a tie for the majority
winner within a direction, that direction's majority winner is set to
NA and the key is excluded from the consistency calculation.
The output details contains exactly one row per unordered pair key,
which keeps it compatible with check_positional_bias.
A list with two elements:
summary: a tibble with one row and columns
n_pairs, n_consistent, and prop_consistent.
Here, n_pairs counts unordered pair keys with a non-missing
majority winner in both directions.
details: a tibble with one row per unordered pair key,
including columns key, ID1_main, ID2_main,
ID1_rev, ID2_rev, better_id_main,
better_id_rev, and is_consistent. Additional columns
provide vote counts and tie flags.
main <- tibble::tibble( ID1 = c("A", "A", "X"), ID2 = c("B", "B", "Y"), better_id = c("A", "B", "X") # duplicate A-B with disagreement ) rev <- tibble::tibble( ID1 = c("B"), ID2 = c("A"), better_id = c("A") ) compute_reverse_consistency(main, rev)$summarymain <- tibble::tibble( ID1 = c("A", "A", "X"), ID2 = c("B", "B", "Y"), better_id = c("A", "B", "X") # duplicate A-B with disagreement ) rev <- tibble::tibble( ID1 = c("B"), ID2 = c("A"), better_id = c("A") ) compute_reverse_consistency(main, rev)$summary
ensure_only_ollama_model_loaded() is a small convenience helper for
managing memory when working with large local models via Ollama. It
inspects the current set of active models using the ollama ps command
and attempts to unload any models that are not the one you specify.
ensure_only_ollama_model_loaded(model, verbose = TRUE)ensure_only_ollama_model_loaded(model, verbose = TRUE)
model |
Character scalar giving the Ollama model name that should
remain loaded (for example |
verbose |
Logical; if |
This can be useful when running multiple large models (for example
"mistral-small3.2:24b", "qwen3:32b", "gemma3:27b") on a single
machine, where keeping all of them loaded simultaneously may exhaust
GPU or system memory.
The function is intentionally conservative:
If the ollama command is not available on the system or
ollama ps returns an error or empty output, no action is taken
and a message is printed when verbose = TRUE.
If no active models are reported, no action is taken.
Only models with names different from model are passed to
ollama stop <name>.
This helper is not called automatically by the package; it is intended
to be used programmatically in development scripts and ad hoc workflows
before running comparisons with ollama_compare_pair_live() or
submit_ollama_pairs_live().
This function relies on the ollama command-line interface being
available on the system PATH. If the command cannot be executed
or returns a non-zero status code, the function will issue a message
(when verbose = TRUE) and return without making any changes.
The exact output format of ollama ps is treated as an
implementation detail: this helper assumes that the first non-empty line
is a header and that subsequent non-empty lines begin with the model
name as the first whitespace-separated field. If the format changes in a
future version of Ollama, parsing may fail and the function will simply
fall back to doing nothing.
Because ollama stop affects the global Ollama server state for the
current machine, you should only use this helper in environments where
you are comfortable unloading models that might be in use by other
processes.
Invisibly returns a character vector containing the names of
models that were requested to be unloaded (i.e., those passed to
ollama stop). If no models were unloaded, an empty character
vector is returned.
ollama_compare_pair_live() for single-pair Ollama comparisons.
submit_ollama_pairs_live() for row-wise Ollama comparisons across
many pairs.
## Not run: # Keep only mistral-small3.2:24b loaded in Ollama, unloading any # other active models ensure_only_ollama_model_loaded("mistral-small3.2:24b") ## End(Not run)## Not run: # Keep only mistral-small3.2:24b loaded in Ollama, unloading any # other active models ensure_only_ollama_model_loaded("mistral-small3.2:24b") ## End(Not run)
Estimate total token usage and cost for running a large set of pairwise comparisons by:
running a small pilot on n_test pairs (live calls) to observe
prompt_tokens and completion_tokens, and
using the pilot to calibrate a prompt-bytes-to-input-token model for the remaining pairs, and
prorating output tokens for the remaining pairs from the pilot distribution.
estimate_llm_pairs_cost( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini", "together"), endpoint = c("chat.completions", "responses"), mode = c("live", "batch"), n_test = 25, test_strategy = c("stratified_prompt_bytes", "random", "first"), seed = NULL, cost_per_million_input, cost_per_million_output, batch_discount = 1, budget_quantile = 0.9, return_test_results = TRUE, return_remaining_pairs = TRUE, ... )estimate_llm_pairs_cost( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini", "together"), endpoint = c("chat.completions", "responses"), mode = c("live", "batch"), n_test = 25, test_strategy = c("stratified_prompt_bytes", "random", "first"), seed = NULL, cost_per_million_input, cost_per_million_output, batch_discount = 1, budget_quantile = 0.9, return_test_results = TRUE, return_remaining_pairs = TRUE, ... )
pairs |
Tibble or data frame with at least columns |
model |
Model name to use for the pilot run (and for the target job). |
trait_name |
Short label for the trait (for example "Overall Quality"). |
trait_description |
Full-text description of the trait or rubric. |
prompt_template |
Prompt template string, typically from
|
backend |
Backend for the pilot run; one of |
endpoint |
OpenAI endpoint; one of |
mode |
Target execution mode for the full job; one of |
n_test |
Number of pilot pairs to run live. Defaults to 25 or fewer if fewer pairs are supplied. |
test_strategy |
Strategy for selecting pilot pairs:
|
seed |
Optional integer seed used for pilot sampling when
|
cost_per_million_input |
Cost per one million input tokens (prompt tokens), in your currency of choice. |
cost_per_million_output |
Cost per one million output tokens (completion tokens). Reasoning/thinking tokens are treated as output. |
batch_discount |
Numeric scalar multiplier applied to the estimated cost
for the remaining pairs when |
budget_quantile |
Quantile used for the "budget" output-token estimate
for remaining pairs. Defaults to |
return_test_results |
Logical; if |
return_remaining_pairs |
Logical; if |
... |
Additional arguments forwarded to |
The estimator does not require a provider tokenizer.
Input tokens are estimated from the byte length of the fully constructed
prompt and calibrated on the pilot's observed prompt_tokens.
An object of class "pairwiseLLM_cost_estimate", a list with:
A one-row tibble with expected and budget token and cost estimates (and pilot usage).
A list describing the input-token calibration (coefficients and fit diagnostics).
The pilot pair subset.
Pilot results (when return_test_results = TRUE).
Remaining pairs (when
return_remaining_pairs = TRUE).
## Not run: # Requires an API key and internet access. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 50, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() est <- estimate_llm_pairs_cost( pairs = pairs, backend = "openai", model = "gpt-4.1", endpoint = "chat.completions", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, mode = "batch", batch_discount = 0.5, n_test = 10, cost_per_million_input = 0.15, cost_per_million_output = 0.60 ) est est$summary # Reuse pilot results and run only remaining pairs: remaining <- est$remaining_pairs ## End(Not run)## Not run: # Requires an API key and internet access. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 50, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() est <- estimate_llm_pairs_cost( pairs = pairs, backend = "openai", model = "gpt-4.1", endpoint = "chat.completions", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, mode = "batch", batch_discount = 0.5, n_test = 10, cost_per_million_input = 0.15, cost_per_million_output = 0.60 ) est est$summary # Reuse pilot results and run only remaining pairs: remaining <- est$remaining_pairs ## End(Not run)
A small character vector containing three example lines from an OpenAI Batch API output file in JSONL format. Each element is a single JSON object representing the result for one batch request.
data("example_openai_batch_output")data("example_openai_batch_output")
A character vector of length 3, where each element is a single JSON line (JSONL).
The structure follows the current Batch API output schema, with
fields such as id, custom_id, and a nested
response object containing status_code,
request_id, and a body that resembles a regular
chat completion response. One line illustrates a successful
comparison where <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>
is returned, one illustrates a case where SAMPLE_2 is preferred,
and one illustrates an error case with a non-200 status.
This dataset is designed for use in examples and tests of batch output parsing functions. Typical usage is to write the lines to a temporary file and then read/parse them as a JSONL batch file.
data("example_openai_batch_output") # Inspect the first line cat(example_openai_batch_output[1], "\n") # Write to a temporary .jsonl file for parsing tmp <- tempfile(fileext = ".jsonl") writeLines(example_openai_batch_output, con = tmp) tmpdata("example_openai_batch_output") # Inspect the first line cat(example_openai_batch_output[1], "\n") # Write to a temporary .jsonl file for parsing tmp <- tempfile(fileext = ".jsonl") writeLines(example_openai_batch_output, con = tmp) tmp
A complete set of unordered paired comparison outcomes for the
20 samples in example_writing_samples. For each
pair of IDs, the better_id field indicates which sample
is assumed to be better, based on the quality_score in
example_writing_samples.
data("example_writing_pairs")data("example_writing_pairs")
A tibble with 190 rows and 3 variables:
Character ID of the first sample in the pair.
Character ID of the second sample in the pair.
Character ID of the sample judged better in
this pair (either ID1 or ID2).
This dataset is useful for demonstrating functions that process
paired comparisons (e.g., building Bradley-Terry data and
fitting btm models) without requiring any
calls to an LLM.
data("example_writing_pairs") head(example_writing_pairs)data("example_writing_pairs") head(example_writing_pairs)
Canonical results_tbl representation of
example_writing_pairs, intended for direct use with
fit_bayes_btl_mcmc and other functions that require adaptive
schema-compatible results input.
data("example_writing_results")data("example_writing_results")
A tibble with 190 rows and 12 variables:
Deterministic pair attempt ID.
Unordered pair key ("min:max").
Ordered pair key ("A_id:B_id").
Character ID in first position.
Character ID in second position.
Character ID judged better in this comparison.
Integer winner position (1L or 2L).
Phase label.
Integer step index.
POSIXct timestamp in UTC.
Backend label for provenance.
Model label for provenance.
data("example_writing_results") head(example_writing_results)data("example_writing_results") head(example_writing_results)
A small set of 20 writing samples on the topic "Why is writing assessment difficult?", intended for use in examples and tests involving pairing and LLM-based comparisons. The samples vary in quality, approximately from very weak to very strong, and a simple numeric quality score is included to support simulated comparison outcomes.
data("example_writing_samples")data("example_writing_samples")
A tibble with 20 rows and 3 variables:
Character ID for each sample (e.g., "S01").
Character string with the writing sample.
Integer from 1 to 10 indicating the intended relative quality of the sample (higher = better).
data("example_writing_samples") example_writing_samplesdata("example_writing_samples") example_writing_samples
A synthetic dataset of 1,000 short writing samples generated by a large language model for use in pairwise comparison and ranking experiments.
data("example_writing_samples1000")data("example_writing_samples1000")
A tibble with 1,000 rows and 7 variables:
Character. Unique sample identifier (S0001–S1000).
Character. The writing sample (approximately 120–180 words).
Integer. Intended quality level used during generation (1–20).
Numeric. Centered latent-quality proxy derived from
quality_level.
Character. Identifier for the generation prompt template.
Character. Language model used to generate the samples.
POSIXct. Timestamp (UTC) when the samples were generated.
Samples are generated in 20 discrete quality levels (1 = lowest, 20 = highest), with multiple responses per level. Quality levels are intended to represent overlapping ranges of overall writing quality rather than a strict total ordering, allowing for realistic noise and near-ties in pairwise judgments.
All samples respond to the same writing prompt to avoid topic effects. The dataset is primarily intended for benchmarking ranking models and for comparing random versus adaptive pair selection strategies under limited judgment budgets.
The column theta_true provides a centered numeric proxy for the latent
quality dimension derived from quality_level. This proxy is intended
for evaluation purposes (e.g., rank recovery or correlation) and does not
imply a perfectly ordered ground truth at the individual-sample level.
Generated via live OpenAI API calls using a controlled, bucketed quality prompt.
See data-raw/generate_example_writing_samples1000.R for details.
data(example_writing_samples1000) head(example_writing_samples1000)data(example_writing_samples1000) head(example_writing_samples1000)
Runs full Bayesian posterior inference for a Bradley–Terry–Luce (BTL) style
model using the package’s CmdStan machinery, but in a standalone
(non-adaptive) context. The function is designed so downstream diagnostics
and reporting can reuse the existing adaptive summary tools (notably
summarize_items() and summarize_refits()) without requiring new summary
functions.
fit_bayes_btl_mcmc( results, ids, model_variant = "btl_e_b", cmdstan = list(iter_warmup = 1000, iter_sampling = 1000, seed = NULL, core_fraction = 0.8), pair_counts = NULL, subset_method = c("first", "sample"), seed = NULL )fit_bayes_btl_mcmc( results, ids, model_variant = "btl_e_b", cmdstan = list(iter_warmup = 1000, iter_sampling = 1000, seed = NULL, core_fraction = 0.8), pair_counts = NULL, subset_method = c("first", "sample"), seed = NULL )
results |
Canonical |
ids |
Character vector of all sample ids (length |
model_variant |
Model variant label: |
cmdstan |
List of CmdStan settings. Common fields:
|
pair_counts |
Optional integer vector of subset sizes (e.g.,
|
subset_method |
Subset strategy when |
seed |
Optional integer seed for deterministic subset selection when
|
Internally, the function can optionally refit the model on increasing subsets
of the observed comparisons (via pair_counts). Each refit is treated
as a "refit" in the adaptive logging sense, producing:
one round-log row per refit (compatible with round_log_schema()),
one item-log table per refit (compatible with .adaptive_item_log_schema()).
A list with:
List of item-log tables, one per refit, matching the
canonical adaptive item log schema. This is the preferred structure for
reuse with summarize_items().
A single tibble formed by row-binding item_log_list
(kept for backward compatibility). Each row corresponds to an item within
a refit; refit_id identifies the refit.
Tibble matching the canonical adaptive round log schema (one row per refit).
List of BTL fit contracts (one per refit).
Single fit contract (only when one refit is run).
## Not run: results <- tibble::tibble( pair_uid = "A:B#1", unordered_key = "A:B", ordered_key = "A:B", A_id = "A", B_id = "B", better_id = "A", winner_pos = 1L, phase = "phase2", iter = 1L, received_at = as.POSIXct("2026-01-01 00:00:00", tz = "UTC"), backend = "openai", model = "gpt-test" ) fit <- fit_bayes_btl_mcmc( results, ids = c("A", "B"), model_variant = "btl_e_b" ) # Generate summaries summarize_refits(fit) summarize_items(fit) ## End(Not run)## Not run: results <- tibble::tibble( pair_uid = "A:B#1", unordered_key = "A:B", ordered_key = "A:B", A_id = "A", B_id = "B", better_id = "A", winner_pos = 1L, phase = "phase2", iter = 1L, received_at = as.POSIXct("2026-01-01 00:00:00", tz = "UTC"), backend = "openai", model = "gpt-test" ) fit <- fit_bayes_btl_mcmc( results, ids = c("A", "B"), model_variant = "btl_e_b" ) # Generate summaries summarize_refits(fit) summarize_items(fit) ## End(Not run)
This function fits a Bradley–Terry paired-comparison model to data
prepared by build_bt_data. It supports two modeling
engines:
sirt: btm — the preferred engine, which
produces ability estimates, standard errors, and MLE reliability.
BradleyTerry2: BTm — used as a
fallback if sirt is unavailable or fails; computes ability
estimates and standard errors, but not reliability.
fit_bt_model( bt_data, engine = c("auto", "sirt", "BradleyTerry2"), verbose = TRUE, ... )fit_bt_model( bt_data, engine = c("auto", "sirt", "BradleyTerry2"), verbose = TRUE, ... )
bt_data |
A data frame or tibble with exactly three columns:
two character ID columns and one numeric |
engine |
Character string specifying the modeling engine. One of:
|
verbose |
Logical. If |
... |
Additional arguments passed through to |
When engine = "auto" (the default), the function attempts
sirt first and automatically falls back to BradleyTerry2
only if necessary. In all cases, the output format is standardized, so
downstream code can rely on consistent fields.
The input bt_data must contain exactly three columns:
object1: character ID for the first item in the pair
object2: character ID for the second item
result: numeric indicator (1 = object1 wins, 0 = object2 wins)
Ability estimates (theta) represent latent "writing quality"
parameters on a log-odds scale. Standard errors are included for both
modeling engines. MLE reliability is only available from sirt.
A list with the following elements:
The engine actually used ("sirt" or "BradleyTerry2").
The fitted model object.
A tibble with columns:
ID: object identifier
theta: estimated ability parameter
se: standard error of theta
MLE reliability (sirt engine only). NA for
BradleyTerry2 models.
# Example using built-in comparison data data("example_writing_pairs") bt <- build_bt_data(example_writing_pairs) fit1 <- fit_bt_model(bt, engine = "sirt") fit2 <- fit_bt_model(bt, engine = "BradleyTerry2")# Example using built-in comparison data data("example_writing_pairs") bt <- build_bt_data(example_writing_pairs) fit1 <- fit_bt_model(bt, engine = "sirt") fit2 <- fit_bt_model(bt, engine = "BradleyTerry2")
This function fits an Elo-based paired-comparison model using the
EloChoice package. It is intended to complement
fit_bt_model by providing an alternative scoring framework
based on Elo ratings rather than Bradley–Terry models.
fit_elo_model(elo_data, runs = 5, verbose = FALSE, ...)fit_elo_model(elo_data, runs = 5, verbose = FALSE, ...)
elo_data |
A data frame or tibble containing |
runs |
Integer number of randomizations to use in
|
verbose |
Logical. If |
... |
Additional arguments passed to
|
The input elo_data must contain two columns:
winner: ID of the winning sample in each pairwise trial
loser: ID of the losing sample in each trial
These can be created from standard pairwise comparison output using
build_elo_data.
Internally, this function calls:
elochoice — to estimate Elo ratings using
repeated randomization of trial order;
reliability — to compute unweighted and
weighted reliability indices as described in Clark et al. (2018).
If the EloChoice package is not installed, a helpful error message is shown telling the user how to install it.
The returned object mirrors the structure of fit_bt_model
for consistency across scoring engines:
engine — always "EloChoice".
fit — the raw "elochoice" object returned by
EloChoice::elochoice().
elo — a tibble with columns:
ID: sample identifier
elo: estimated Elo rating
(Unlike Bradley–Terry models, EloChoice does not provide standard errors for these ratings, so none are returned.)
reliability — the mean unweighted reliability index
(mean proportion of “upsets” across randomizations).
reliability_weighted — the mean weighted reliability index
(weighted version of the upset measure).
A named list with components:
Character scalar identifying the scoring engine
("EloChoice").
The "elochoice" model object.
A tibble with columns ID and elo.
Numeric scalar: mean unweighted reliability index.
Numeric scalar: mean weighted reliability index.
Clark AP, Howard KL, Woods AT, Penton-Voak IS, Neumann C (2018). "Why rate when you could compare? Using the 'EloChoice' package to assess pairwise comparisons of perceived physical strength." PLOS ONE, 13(1), e0190393. doi:10.1371/journal.pone.0190393.
data("example_writing_pairs", package = "pairwiseLLM") elo_data <- build_elo_data(example_writing_pairs) fit <- fit_elo_model(elo_data, runs = 5, verbose = FALSE) fit$elo fit$reliability fit$reliability_weighteddata("example_writing_pairs", package = "pairwiseLLM") elo_data <- build_elo_data(example_writing_pairs) fit <- fit_elo_model(elo_data, runs = 5, verbose = FALSE) fit$elo fit$reliability fit$reliability_weighted
This function sends a single pairwise comparison prompt to the Google Gemini Generative Language API (Gemini 3 Pro / Flash) and parses the result into a one-row tibble that mirrors the structure used for OpenAI / Anthropic live calls.
gemini_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, thinking_level = "low", temperature = NULL, top_p = NULL, top_k = NULL, max_output_tokens = NULL, api_version = "v1beta", include_raw = FALSE, include_thoughts = FALSE, pair_uid = NULL, ... )gemini_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, thinking_level = "low", temperature = NULL, top_p = NULL, top_k = NULL, max_output_tokens = NULL, api_version = "v1beta", include_raw = FALSE, include_thoughts = FALSE, pair_uid = NULL, ... )
ID1 |
Character ID for the first sample. |
text1 |
Character containing the first sample text. |
ID2 |
Character ID for the second sample. |
text2 |
Character containing the second sample text. |
model |
Gemini model identifier (for example |
trait_name |
Short label for the trait (e.g. |
trait_description |
Full-text trait / rubric description. |
prompt_template |
Prompt template string, typically from
|
api_key |
Optional Gemini API key (defaults to
|
thinking_level |
One of
|
temperature |
Optional numeric temperature. If |
top_p |
Optional nucleus sampling parameter. If |
top_k |
Optional top-k sampling parameter. If |
max_output_tokens |
Optional maximum output token count. If |
api_version |
API version to use, default |
include_raw |
Logical; if |
include_thoughts |
Logical; if |
pair_uid |
Optional stable per-pair identifier; when supplied, this
value is used verbatim as |
... |
Reserved for future extensions. Any |
It expects the prompt template to instruct the model to choose exactly one of SAMPLE_1 or SAMPLE_2 and wrap the decision in <BETTER_SAMPLE> tags, for example:
<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>
or
<BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
If include_thoughts = TRUE, the function additionally requests Gemini's
explicit chain-of-thought style reasoning ("thoughts") via the
thinkingConfig block and stores it in a separate thoughts column, while
still using the final answer content to detect the <BETTER_SAMPLE> tag.
A tibble with one row and columns:
custom_id - stable ID for the pair (pair_uid if supplied).
ID1, ID2 - provided sample IDs.
model - model name returned by the API (or the requested model).
object_type - "generateContent" on success, otherwise NA.
status_code - HTTP status code (200 on success).
error_message - error message for failures, otherwise NA.
thoughts - explicit chain-of-thought style reasoning text if
include_thoughts = TRUE and the model returns it; otherwise NA.
content - concatenated text of the assistant's final answer (used to
locate the <BETTER_SAMPLE> tag).
better_sample - "SAMPLE_1", "SAMPLE_2", or NA.
better_id - ID1 if SAMPLE_1 is chosen,
ID2 if SAMPLE_2, or NA.
prompt_tokens, completion_tokens, total_tokens - usage counts if
reported by the API, otherwise NA_real_.
# Requires: # - GEMINI_API_KEY set in your environment # - Internet access # - Billable Gemini API usage ## Not run: td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Gemini 3 Pro example (existing behavior) res <- gemini_compare_pair_live( ID1 = "S01", text1 = "Text 1", ID2 = "S02", text2 = "Text 2", model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low", include_thoughts = FALSE, include_raw = FALSE ) res res$better_id # Gemini 3 Flash example (minimal thinking) res_flash <- gemini_compare_pair_live( ID1 = "S01", text1 = "Text 1", ID2 = "S02", text2 = "Text 2", model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", include_thoughts = FALSE, include_raw = FALSE ) res_flash ## End(Not run)# Requires: # - GEMINI_API_KEY set in your environment # - Internet access # - Billable Gemini API usage ## Not run: td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Gemini 3 Pro example (existing behavior) res <- gemini_compare_pair_live( ID1 = "S01", text1 = "Text 1", ID2 = "S02", text2 = "Text 2", model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low", include_thoughts = FALSE, include_raw = FALSE ) res res$better_id # Gemini 3 Flash example (minimal thinking) res_flash <- gemini_compare_pair_live( ID1 = "S01", text1 = "Text 1", ID2 = "S02", text2 = "Text 2", model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", include_thoughts = FALSE, include_raw = FALSE ) res_flash ## End(Not run)
This is a thin wrapper around the REST endpoint
/v1beta/models/<MODEL>:batchGenerateContent. It accepts a list of
GenerateContent request objects and returns the created Batch job.
gemini_create_batch( requests, model, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta", display_name = NULL )gemini_create_batch( requests, model, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta", display_name = NULL )
requests |
List of GenerateContent request objects, each of the form
|
model |
Gemini model name, for example |
api_key |
Optional Gemini API key. Defaults to
|
api_version |
API version string for the path; defaults to
|
display_name |
Optional display name for the batch. |
Typically you will not call this directly; instead, use
run_gemini_batch_pipeline which builds requests from a tibble
of pairs, creates the batch, polls for completion, and parses the results.
A list representing the Batch job object returned by Gemini.
Important fields include name, metadata$state,
and (after completion) response$inlinedResponses or
response$responsesFile.
# --- Offline preparation: build GenerateContent requests --- data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 2, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() batch_tbl <- build_gemini_batch_requests( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low" ) # Extract the list of request objects requests <- batch_tbl$request # Inspect a single GenerateContent request (purely local) requests[[1]] # --- Online step: create the Gemini Batch job --- # Requires network access and a valid Gemini API key. ## Not run: batch <- gemini_create_batch( requests = requests, model = "gemini-3-pro-preview" ) batch$name batch$metadata$state ## End(Not run)# --- Offline preparation: build GenerateContent requests --- data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 2, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() batch_tbl <- build_gemini_batch_requests( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low" ) # Extract the list of request objects requests <- batch_tbl$request # Inspect a single GenerateContent request (purely local) requests[[1]] # --- Online step: create the Gemini Batch job --- # Requires network access and a valid Gemini API key. ## Not run: batch <- gemini_create_batch( requests = requests, model = "gemini-3-pro-preview" ) batch$name batch$metadata$state ## End(Not run)
For inline batch requests, Gemini returns results under
response$inlinedResponses$inlinedResponses. In the v1beta REST API
this often comes back as a data frame with one row per request and a
"response" column, where each "response" is itself a data frame
of GenerateContentResponse objects.
gemini_download_batch_results( batch, requests_tbl, output_path, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta" )gemini_download_batch_results( batch, requests_tbl, output_path, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta" )
batch |
Either a parsed batch object (as returned by
|
requests_tbl |
Tibble/data frame with a |
output_path |
Path to the JSONL file to create. |
api_key |
Optional Gemini API key (used only when
|
api_version |
API version (default |
This helper writes those results to a local .jsonl file where each
line is a JSON object of the form:
{"custom_id": "<GEM_ID1_vs_ID2>",
"result": {
"type": "succeeded",
"response": { ... GenerateContentResponse ... }
}}
or, when an error occurred:
{"custom_id": "<GEM_ID1_vs_ID2>",
"result": {
"type": "errored",
"error": { ... }
}}
Invisibly returns output_path.
# This example requires a Gemini API key and network access. # It assumes you have already created and run a Gemini batch job. ## Not run: # Name of an existing Gemini batch batch_name <- "batches/123456" # Requests table used to create the batch (must include custom_id) requests_tbl <- tibble::tibble( custom_id = c("GEM_S01_vs_S02", "GEM_S03_vs_S04") ) # Download inline batch results to a local JSONL file out_file <- tempfile(fileext = ".jsonl") gemini_download_batch_results( batch = batch_name, requests_tbl = requests_tbl, output_path = out_file ) # Inspect the downloaded JSONL readLines(out_file, warn = FALSE) ## End(Not run)# This example requires a Gemini API key and network access. # It assumes you have already created and run a Gemini batch job. ## Not run: # Name of an existing Gemini batch batch_name <- "batches/123456" # Requests table used to create the batch (must include custom_id) requests_tbl <- tibble::tibble( custom_id = c("GEM_S01_vs_S02", "GEM_S03_vs_S04") ) # Download inline batch results to a local JSONL file out_file <- tempfile(fileext = ".jsonl") gemini_download_batch_results( batch = batch_name, requests_tbl = requests_tbl, output_path = out_file ) # Inspect the downloaded JSONL readLines(out_file, warn = FALSE) ## End(Not run)
This retrieves the latest state of a Batch job using its name as
returned by gemini_create_batch.
gemini_get_batch( batch_name, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta" )gemini_get_batch( batch_name, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta" )
batch_name |
Character scalar giving the batch name. |
api_key |
Optional Gemini API key. Defaults to
|
api_version |
API version string for the path; defaults to
|
It corresponds to a GET request on /v1beta/<BATCH_NAME>, where
BATCH_NAME is a string such as "batches/123456".
A list representing the Batch job object.
# Offline: basic batch name validation / object you would pass batch_name <- "batches/123456" # Online: retrieve the batch state from Gemini (requires API key + network) ## Not run: batch <- gemini_get_batch(batch_name = batch_name) batch$name batch$metadata$state ## End(Not run)# Offline: basic batch name validation / object you would pass batch_name <- "batches/123456" # Online: retrieve the batch state from Gemini (requires API key + network) ## Not run: batch <- gemini_get_batch(batch_name = batch_name) batch$name batch$metadata$state ## End(Not run)
This helper repeatedly calls gemini_get_batch until the
batch's metadata$state enters a terminal state or a time limit is
reached. For the REST API, states have the form "BATCH_STATE_*".
gemini_poll_batch_until_complete( batch_name, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta", verbose = TRUE )gemini_poll_batch_until_complete( batch_name, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta", verbose = TRUE )
batch_name |
Character scalar giving the batch name. |
interval_seconds |
Polling interval in seconds. Defaults to 60. |
timeout_seconds |
Maximum total waiting time in seconds. Defaults to 24 hours (86400 seconds). |
api_key |
Optional Gemini API key. Defaults to
|
api_version |
API version string for the path; defaults to
|
verbose |
Logical; if |
The final Batch job object as returned by
gemini_get_batch.
# Offline: polling parameters and batch name are plain R objects batch_name <- "batches/123456" # Online: poll until the batch reaches a terminal state (requires network) ## Not run: final_batch <- gemini_poll_batch_until_complete( batch_name = batch_name, interval_seconds = 10, timeout_seconds = 600, verbose = TRUE ) final_batch$metadata$state ## End(Not run)# Offline: polling parameters and batch name are plain R objects batch_name <- "batches/123456" # Online: poll until the batch reaches a terminal state (requires network) ## Not run: final_batch <- gemini_poll_batch_until_complete( batch_name = batch_name, interval_seconds = 10, timeout_seconds = 600, verbose = TRUE ) final_batch$metadata$state ## End(Not run)
This function retrieves a prompt template from either:
the user registry (see register_prompt_template), or
a built-in template stored under inst/templates.
get_prompt_template(name = "default")get_prompt_template(name = "default")
name |
Character scalar giving the template name. |
The function first checks user-registered templates, then looks for
a built-in text file inst/templates/<name>.txt. The special
name "default" falls back to set_prompt_template()
when no user-registered or built-in template is found.
A single character string containing the prompt template.
register_prompt_template,
list_prompt_templates,
remove_prompt_template
# Get the built-in default template tmpl_default <- get_prompt_template("default") # List available template names list_prompt_templates()# Get the built-in default template tmpl_default <- get_prompt_template("default") # List available template names list_prompt_templates()
This function lists template names that are available either as
built-in text files under inst/templates or as
user-registered templates in the current R session.
list_prompt_templates(include_builtin = TRUE, include_registered = TRUE)list_prompt_templates(include_builtin = TRUE, include_registered = TRUE)
include_builtin |
Logical; include built-in template names
(the default is |
include_registered |
Logical; include user-registered names
(the default is |
Built-in templates are identified by files named
<name>.txt within inst/templates. For example, a
file inst/templates/minimal.txt will be listed as
"minimal".
A sorted character vector of unique template names.
list_prompt_templates()list_prompt_templates()
llm_compare_pair() is a thin wrapper around backend-specific comparison
functions. It currently supports the "openai", "anthropic", "gemini",
"together", and "ollama" backends and forwards the call to the
appropriate live comparison helper:
"openai" → openai_compare_pair_live()
"anthropic" → anthropic_compare_pair_live()
"gemini" → gemini_compare_pair_live()
"together" → together_compare_pair_live()
"ollama" → ollama_compare_pair_live()
llm_compare_pair( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini", "together", "ollama"), endpoint = c("chat.completions", "responses"), api_key = NULL, include_raw = FALSE, ... )llm_compare_pair( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini", "together", "ollama"), endpoint = c("chat.completions", "responses"), api_key = NULL, include_raw = FALSE, ... )
ID1 |
Character ID for the first sample. |
text1 |
Character string containing the first sample's text. |
ID2 |
Character ID for the second sample. |
text2 |
Character string containing the second sample's text. |
model |
Model identifier for the chosen backend. For |
trait_name |
Short label for the trait (for example
|
trait_description |
Full-text definition of the trait. |
prompt_template |
Prompt template string, typically from
|
backend |
Character scalar indicating which LLM provider to use.
One of |
endpoint |
Character scalar specifying which endpoint family to use
for backends that support multiple live APIs. For the |
api_key |
Optional API key for the selected backend. If |
include_raw |
Logical; if |
... |
Additional backend-specific parameters. For |
All backends are expected to return a tibble with a compatible structure, including:
custom_id, ID1, ID2
model, object_type, status_code,
error_message
thoughts (reasoning / thinking text when available)
content (visible assistant output)
better_sample, better_id
prompt_tokens, completion_tokens, total_tokens
For the "openai" backend, the endpoint argument controls whether
the Chat Completions API ("chat.completions") or the Responses API
("responses") is used. For the "anthropic", "gemini", and
"ollama" backends, endpoint is currently ignored and the default
live API for that provider is used.
A tibble with one row and the same columns as the underlying
backend-specific live helper (for example openai_compare_pair_live()
for "openai"). All backends are intended to return a compatible
structure including thoughts, content, and token counts.
openai_compare_pair_live(), anthropic_compare_pair_live(),
gemini_compare_pair_live(), together_compare_pair_live(), and
ollama_compare_pair_live() for backend-specific implementations.
submit_llm_pairs() for row-wise comparisons over a tibble of pairs.
build_bt_data() and fit_bt_model() for Bradley–Terry modelling of
comparison results.
## Not run: # Requires an API key for the chosen cloud backend. For OpenAI, set # OPENAI_API_KEY in your environment. Running these examples will incur # API usage costs. # # For local Ollama use, an Ollama server must be running and the models # must be pulled in advance. No API key is required for the `"ollama"` # backend. data("example_writing_samples", package = "pairwiseLLM") samples <- example_writing_samples[1:2, ] td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Single live comparison using the OpenAI backend and chat.completions res_live <- llm_compare_pair( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "openai", endpoint = "chat.completions", temperature = 0 ) res_live$better_id # Using the OpenAI responses endpoint with gpt-5.1 and reasoning = "low" res_live_gpt5 <- llm_compare_pair( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "gpt-5.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "openai", endpoint = "responses", reasoning = "low", include_thoughts = TRUE, temperature = NULL, top_p = NULL, logprobs = NULL, include_raw = TRUE ) str(res_live_gpt5$raw_response[[1]], max.level = 2) # Example: single live comparison using a local Ollama backend res_ollama <- llm_compare_pair( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "ollama", host = getOption( "pairwiseLLM.ollama_host", "http://127.0.0.1:11434" ), think = FALSE ) res_ollama$better_id ## End(Not run)## Not run: # Requires an API key for the chosen cloud backend. For OpenAI, set # OPENAI_API_KEY in your environment. Running these examples will incur # API usage costs. # # For local Ollama use, an Ollama server must be running and the models # must be pulled in advance. No API key is required for the `"ollama"` # backend. data("example_writing_samples", package = "pairwiseLLM") samples <- example_writing_samples[1:2, ] td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Single live comparison using the OpenAI backend and chat.completions res_live <- llm_compare_pair( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "openai", endpoint = "chat.completions", temperature = 0 ) res_live$better_id # Using the OpenAI responses endpoint with gpt-5.1 and reasoning = "low" res_live_gpt5 <- llm_compare_pair( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "gpt-5.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "openai", endpoint = "responses", reasoning = "low", include_thoughts = TRUE, temperature = NULL, top_p = NULL, logprobs = NULL, include_raw = TRUE ) str(res_live_gpt5$raw_response[[1]], max.level = 2) # Example: single live comparison using a local Ollama backend res_ollama <- llm_compare_pair( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "ollama", host = getOption( "pairwiseLLM.ollama_host", "http://127.0.0.1:11434" ), think = FALSE ) res_ollama$better_id ## End(Not run)
Helper to extract the parsed results tibble from a batch object returned by
llm_submit_pairs_batch(). This is a thin wrapper around the results
element returned by backend-specific batch pipelines and is designed to be
forward-compatible with future, more asynchronous batch workflows.
llm_download_batch_results(x, ...)llm_download_batch_results(x, ...)
x |
An object returned by |
... |
Reserved for future use; currently ignored. |
A tibble containing batch comparison results in the standard pairwiseLLM schema.
## Not run: # Requires running a provider batch job first (API key + internet + cost). batch <- llm_submit_pairs_batch( pairs = tibble::tibble( ID1 = "S01", text1 = "Text 1", ID2 = "S02", text2 = "Text 2" ), backend = "openai", model = "gpt-4.1", trait_name = trait_description("overall_quality")$name, trait_description = trait_description("overall_quality")$description, prompt_template = set_prompt_template() ) res <- llm_download_batch_results(batch) res ## End(Not run)## Not run: # Requires running a provider batch job first (API key + internet + cost). batch <- llm_submit_pairs_batch( pairs = tibble::tibble( ID1 = "S01", text1 = "Text 1", ID2 = "S02", text2 = "Text 2" ), backend = "openai", model = "gpt-4.1", trait_name = trait_description("overall_quality")$name, trait_description = trait_description("overall_quality")$description, prompt_template = set_prompt_template() ) res <- llm_download_batch_results(batch) res ## End(Not run)
This function takes the output of llm_submit_pairs_multi_batch() (or a
previously written registry CSV) and polls each batch until completion,
downloading and parsing results as they finish. It implements a
conservative polling loop with a configurable interval between rounds and
a small delay between individual jobs to reduce the risk of API rate‑limit
errors. The httr2 retry wrapper is still invoked for each API call, so
transient HTTP errors will be retried with exponential back‑off.
llm_resume_multi_batches( jobs = NULL, output_dir = NULL, interval_seconds = 60, per_job_delay = 2, write_results_csv = FALSE, keep_jsonl = TRUE, write_registry = FALSE, tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", verbose = FALSE, write_combined_csv = FALSE, combined_csv_path = NULL, openai_max_retries = 3 )llm_resume_multi_batches( jobs = NULL, output_dir = NULL, interval_seconds = 60, per_job_delay = 2, write_results_csv = FALSE, keep_jsonl = TRUE, write_registry = FALSE, tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", verbose = FALSE, write_combined_csv = FALSE, combined_csv_path = NULL, openai_max_retries = 3 )
jobs |
A list of job objects returned by
|
output_dir |
Directory containing the batch files and (optionally) the
registry CSV. If |
interval_seconds |
Number of seconds to wait between rounds of polling
unfinished batches. The default ( |
per_job_delay |
Number of seconds to wait between polling individual jobs within a single round. A small delay (e.g. 2) can help prevent 429 (Too Many Requests) responses. |
write_results_csv |
Logical; if |
keep_jsonl |
Logical; if |
write_registry |
Logical; if |
tag_prefix, tag_suffix
|
Character strings passed to
|
verbose |
Logical; if |
write_combined_csv |
Logical; if |
combined_csv_path |
Optional file path for the combined results CSV.
If |
openai_max_retries |
Integer giving the maximum number of times to
retry certain OpenAI API calls when a transient HTTP 5xx error occurs.
In particular, when downloading batch output with
|
A list with four elements: jobs, the updated jobs list with each
element containing parsed results and a done flag; combined, a tibble
obtained by binding all completed results (NULL if no batches
completed); failed_attempts, a tibble of failed attempts captured
during normalization; and batch_failures, a tibble describing batches
that reached a terminal non-success status. If write_results_csv is
TRUE, the combined tibble is still returned in memory. If
write_combined_csv is TRUE, the combined tibble is also written to a
CSV file on disk (see combined_csv_path for details) but is still
returned in memory.
# Continuing the example from llm_submit_pairs_multi_batch(): # After submitting multiple batches, resume polling and combine the results. ## Not run: # Suppose `outdir` is the directory where batch files were written and # `jobs` is the list of job metadata returned by llm_submit_pairs_multi_batch(). results <- llm_resume_multi_batches( jobs = jobs, output_dir = outdir, interval_seconds = 60, per_job_delay = 2, write_results_csv = TRUE, keep_jsonl = FALSE, write_registry = TRUE, verbose = TRUE, write_combined_csv = TRUE ) # The combined results are available in the `combined` element print(results$combined) ## End(Not run)# Continuing the example from llm_submit_pairs_multi_batch(): # After submitting multiple batches, resume polling and combine the results. ## Not run: # Suppose `outdir` is the directory where batch files were written and # `jobs` is the list of job metadata returned by llm_submit_pairs_multi_batch(). results <- llm_resume_multi_batches( jobs = jobs, output_dir = outdir, interval_seconds = 60, per_job_delay = 2, write_results_csv = TRUE, keep_jsonl = FALSE, write_registry = TRUE, verbose = TRUE, write_combined_csv = TRUE ) # The combined results are available in the `combined` element print(results$combined) ## End(Not run)
llm_submit_pairs_batch() is a backend-agnostic front-end for running
provider batch pipelines (OpenAI, Anthropic, Gemini). Together.ai and Ollama
are supported only for live comparisons.
It mirrors submit_llm_pairs() but uses the provider batch APIs under the
hood via run_openai_batch_pipeline(), run_anthropic_batch_pipeline(),
and run_gemini_batch_pipeline().
For OpenAI, this helper will by default:
Use the chat.completions batch style for most models, and
Automatically switch to the responses style endpoint when:
model is in the GPT-5 series (including gpt-5, gpt-5-mini, and
date-stamped gpt-5.1/5.2 variants), and
either include_thoughts = TRUE or a reasoning effort is supplied
in ... (for GPT-5, reasoning = "none" maps to "minimal").
Temperature Defaults:
For OpenAI, if temperature is not specified in ...:
It defaults to 0 (deterministic) for standard models or when reasoning is
disabled (reasoning = "none") on supported GPT-5.1/5.2 models.
It remains NULL (API default) when reasoning is enabled, or for GPT-5
minimal reasoning (which ignores temperature).
For Anthropic, standard and date-stamped model names
(e.g. "claude-sonnet-4-5-20250929") are supported. This helper delegates
temperature and extended-thinking behaviour to
run_anthropic_batch_pipeline() and build_anthropic_batch_requests(),
which apply the following rules:
When reasoning = "none" (no extended thinking), the default
temperature is 0 (deterministic) unless you explicitly supply a
different temperature in ....
When reasoning = "enabled" (extended thinking), Anthropic requires
temperature = 1. If you supply a different value in ..., an error
is raised. Default values in this mode are max_tokens = 2048 and
thinking_budget_tokens = 1024, subject to
1024 <= thinking_budget_tokens < max_tokens.
Setting include_thoughts = TRUE while leaving reasoning = "none"
causes run_anthropic_batch_pipeline() to upgrade to
reasoning = "enabled", which implies temperature = 1 for the batch.
For Gemini, this helper simply forwards include_thoughts and other
arguments to run_gemini_batch_pipeline(), which is responsible for
interpreting any thinking-related options.
Currently, this function synchronously runs the full batch pipeline for
each backend (build requests, create batch, poll until complete, download
results, parse). The returned object contains both metadata and a normalized
results tibble. See llm_download_batch_results() to extract the results.
llm_submit_pairs_batch( pairs, backend = c("openai", "anthropic", "gemini"), model, trait_name, trait_description, prompt_template = set_prompt_template(), include_thoughts = FALSE, include_raw = FALSE, ... )llm_submit_pairs_batch( pairs, backend = c("openai", "anthropic", "gemini"), model, trait_name, trait_description, prompt_template = set_prompt_template(), include_thoughts = FALSE, include_raw = FALSE, ... )
pairs |
A data frame or tibble of pairs with columns |
backend |
Character scalar; one of |
model |
Character scalar model name to use for the batch job.
|
trait_name |
A short name for the trait being evaluated (e.g.
|
trait_description |
A human-readable description of the trait. |
prompt_template |
A prompt template created by |
include_thoughts |
Logical; whether to request and parse model "thoughts" (where supported).
|
include_raw |
Logical; whether to include raw provider responses in the result (where supported by backends). |
... |
Additional arguments passed through to the backend-specific
|
A list of class "pairwiseLLM_batch" containing at least:
backend: the backend identifier ("openai", "anthropic", "gemini"),
batch_input_path: path to the JSONL request file (if applicable),
batch_output_path: path to the JSONL output file (if applicable),
batch: provider-specific batch object (e.g., job metadata),
results: a tibble of parsed comparison results in the standard
pairwiseLLM schema.
failed_attempts: a tibble of failed attempts captured during
normalization (empty when no failures are observed).
Additional fields returned by the backend-specific pipeline functions are preserved.
# Requires: # - Internet access # - Provider API key set in your environment (OPENAI_API_KEY / # ANTHROPIC_API_KEY / GEMINI_API_KEY) # - Billable API usage ## Not run: pairs <- tibble::tibble( ID1 = c("S01", "S03"), text1 = c("Text 1", "Text 3"), ID2 = c("S02", "S04"), text2 = c("Text 2", "Text 4") ) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # OpenAI batch batch_openai <- llm_submit_pairs_batch( pairs = pairs, backend = "openai", model = "gpt-5-mini", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = FALSE, service_tier = "flex" ) res_openai <- llm_download_batch_results(batch_openai) # Anthropic batch batch_anthropic <- llm_submit_pairs_batch( pairs = pairs, backend = "anthropic", model = "claude-4-5-sonnet", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = FALSE ) res_anthropic <- llm_download_batch_results(batch_anthropic) # Gemini batch batch_gemini <- llm_submit_pairs_batch( pairs = pairs, backend = "gemini", model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = TRUE ) res_gemini <- llm_download_batch_results(batch_gemini) ## End(Not run)# Requires: # - Internet access # - Provider API key set in your environment (OPENAI_API_KEY / # ANTHROPIC_API_KEY / GEMINI_API_KEY) # - Billable API usage ## Not run: pairs <- tibble::tibble( ID1 = c("S01", "S03"), text1 = c("Text 1", "Text 3"), ID2 = c("S02", "S04"), text2 = c("Text 2", "Text 4") ) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # OpenAI batch batch_openai <- llm_submit_pairs_batch( pairs = pairs, backend = "openai", model = "gpt-5-mini", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = FALSE, service_tier = "flex" ) res_openai <- llm_download_batch_results(batch_openai) # Anthropic batch batch_anthropic <- llm_submit_pairs_batch( pairs = pairs, backend = "anthropic", model = "claude-4-5-sonnet", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = FALSE ) res_anthropic <- llm_download_batch_results(batch_anthropic) # Gemini batch batch_gemini <- llm_submit_pairs_batch( pairs = pairs, backend = "gemini", model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = TRUE ) res_gemini <- llm_download_batch_results(batch_gemini) ## End(Not run)
These functions provide higher‑level wrappers around the existing
provider‑specific batch APIs in pairwiseLLM. They allow a large tibble of
pairwise comparisons to be automatically split into multiple batch jobs,
submitted concurrently (without polling), recorded in a registry for safe
resumption, and later polled until completion and merged into a single
results data frame. They do not modify any of the underlying API functions
such as run_openai_batch_pipeline() or run_anthropic_batch_pipeline(),
but orchestrate these calls to support resilient multi‑batch workflows.
llm_submit_pairs_multi_batch( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini"), batch_size = NULL, n_segments = NULL, output_dir = tempfile("llm_multi_batch_"), write_registry = FALSE, keep_jsonl = TRUE, verbose = FALSE, ..., openai_max_retries = 3 )llm_submit_pairs_multi_batch( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini"), batch_size = NULL, n_segments = NULL, output_dir = tempfile("llm_multi_batch_"), write_registry = FALSE, keep_jsonl = TRUE, verbose = FALSE, ..., openai_max_retries = 3 )
pairs |
A tibble of pairs with columns |
model |
Model identifier for the chosen backend. Passed through to
the corresponding |
trait_name, trait_description, prompt_template
|
Parameters forwarded
to |
backend |
One of |
batch_size |
Integer giving the maximum number of pairs per batch.
Exactly one of |
n_segments |
Integer giving the number of segments to create. Exactly
one of |
output_dir |
Directory in which to write all batch files, including the
|
write_registry |
Logical; if |
keep_jsonl |
Logical; if |
verbose |
Logical; if |
... |
Additional arguments passed through to the provider‑specific
|
openai_max_retries |
Integer giving the maximum number of times
to retry the initial OpenAI batch submission when a transient
HTTP 5xx error occurs. When creating a segment on the OpenAI
backend, |
A list with two elements: jobs, a list of per‑batch metadata
(similar to the example in the advanced vignette), and registry,
a tibble summarising all jobs. The registry contains columns
segment_index, provider, model, batch_id, batch_input_path,
batch_output_path, csv_path, pairs_path, done, and results
(initialized to NULL). If write_registry is TRUE, the tibble is also written
to disk as jobs_registry.csv.
llm_submit_pairs_multi_batch()Splits a tibble of comparison pairs into chunks and submits one batch per
chunk using the appropriate provider pipeline. Each batch is created with
poll = FALSE, so the function returns immediately after the batch jobs
have been created. Metadata for each batch—including the batch_id,
provider type, and input/output file paths—is collected and (optionally)
written to a CSV registry for later resumption.
# Example: split a small set of pairs into five segments, submit # them to the Gemini backend, and then poll and combine the results. # Requires a funded API key and internet access. ## Not run: # Construct ten random pairs from the example writing samples set.seed(123) pairs <- sample_pairs(example_writing_samples, n_pairs = 10) # Directory to store batch files and results outdir <- tempfile("multi_batch_example_") # Submit the pairs in five batches. We write the registry to disk # and print progress messages as each batch is created. job_info <- llm_submit_pairs_multi_batch( pairs = pairs, model = "gemini-3-pro-preview", trait_name = "writing_quality", trait_description = "Which text shows better writing quality?", n_segments = 5, output_dir = outdir, write_registry = TRUE, verbose = TRUE ) # Resume polling until all batches complete. The per-batch and # combined results are written to CSV files, the registry is # refreshed on disk, and progress messages are printed. results <- llm_resume_multi_batches( jobs = job_info$jobs, output_dir = outdir, interval_seconds = 60, per_job_delay = 2, write_results_csv = TRUE, keep_jsonl = FALSE, write_registry = TRUE, verbose = TRUE, write_combined_csv = TRUE ) # Access the combined results tibble head(results$combined) ## End(Not run)# Example: split a small set of pairs into five segments, submit # them to the Gemini backend, and then poll and combine the results. # Requires a funded API key and internet access. ## Not run: # Construct ten random pairs from the example writing samples set.seed(123) pairs <- sample_pairs(example_writing_samples, n_pairs = 10) # Directory to store batch files and results outdir <- tempfile("multi_batch_example_") # Submit the pairs in five batches. We write the registry to disk # and print progress messages as each batch is created. job_info <- llm_submit_pairs_multi_batch( pairs = pairs, model = "gemini-3-pro-preview", trait_name = "writing_quality", trait_description = "Which text shows better writing quality?", n_segments = 5, output_dir = outdir, write_registry = TRUE, verbose = TRUE ) # Resume polling until all batches complete. The per-batch and # combined results are written to CSV files, the registry is # refreshed on disk, and progress messages are printed. results <- llm_resume_multi_batches( jobs = job_info$jobs, output_dir = outdir, interval_seconds = 60, per_job_delay = 2, write_results_csv = TRUE, keep_jsonl = FALSE, write_registry = TRUE, verbose = TRUE, write_combined_csv = TRUE ) # Access the combined results tibble head(results$combined) ## End(Not run)
Load an adaptive session from disk.
load_adaptive_session(session_dir)load_adaptive_session(session_dir)
session_dir |
Directory containing session artifacts. |
Restores a persisted Adaptive state and revalidates basic invariants such
as schema version, required state fields, and index ranges in
step_log. If per-refit item logs are found on disk, they are loaded
into state$item_log and persistence is marked as enabled. Resume uses
strict schema validation for canonical logs; incompatible saved schemas abort
with explicit errors.
An adaptive_state object ready for resume.
save_adaptive_session(), validate_session_dir(), adaptive_rank_resume()
Other adaptive persistence:
save_adaptive_session(),
validate_session_dir()
dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) save_adaptive_session(state, dir, overwrite = TRUE) restored <- load_adaptive_session(dir) summarize_adaptive(restored)dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) save_adaptive_session(state, dir, overwrite = TRUE) restored <- load_adaptive_session(dir) summarize_adaptive(restored)
Creates a judge function compatible with adaptive_rank_run_live() by
wrapping llm_compare_pair() and converting provider responses into
adaptive binary outcomes (Y in {0,1}).
make_adaptive_judge_llm( backend = c("openai", "anthropic", "gemini", "together", "ollama"), model, trait = "overall_quality", trait_name = NULL, trait_description = NULL, prompt_template = set_prompt_template(), endpoint = "chat.completions", api_key = NULL, include_raw = FALSE, text_col = "text", judge_args = list() )make_adaptive_judge_llm( backend = c("openai", "anthropic", "gemini", "together", "ollama"), model, trait = "overall_quality", trait_name = NULL, trait_description = NULL, prompt_template = set_prompt_template(), endpoint = "chat.completions", api_key = NULL, include_raw = FALSE, text_col = "text", judge_args = list() )
backend |
Backend passed to |
model |
Model identifier passed to |
trait |
Built-in trait key used when no custom trait is supplied.
Ignored when both |
trait_name |
Optional custom trait display name. |
trait_description |
Optional custom trait definition. |
prompt_template |
Prompt template string. Defaults to
|
endpoint |
Endpoint family passed to |
api_key |
Optional API key passed to |
include_raw |
Logical; forwarded to |
text_col |
Name of the text column expected in adaptive item rows. |
judge_args |
Named list of additional fixed arguments forwarded to
|
The returned function has signature judge(A, B, state, ...) and enforces
the adaptive transactional contract:
it returns is_valid = TRUE with Y in {0,1} when the model response
identifies one of the two presented items, and returns is_valid = FALSE
otherwise.
Model configuration is split into:
fixed build-time options via judge_args,
per-run overrides via judge_call_args in adaptive_rank(),
optional per-step overrides via ... passed through
adaptive_rank_run_live().
Collectively this supports all llm_compare_pair() options, including
backend-specific parameters such as OpenAI reasoning and service_tier.
A function judge(A, B, state, ...) returning a list with fields
is_valid, Y, and invalid_reason.
adaptive_rank(), adaptive_rank_run_live(), llm_compare_pair()
Other adaptive ranking:
adaptive_rank(),
adaptive_rank_resume(),
adaptive_rank_run_live(),
adaptive_rank_start(),
summarize_adaptive()
judge <- make_adaptive_judge_llm( backend = "openai", model = "gpt-5.1", endpoint = "responses", judge_args = list( reasoning = "low", service_tier = "flex", include_thoughts = FALSE ) )judge <- make_adaptive_judge_llm( backend = "openai", model = "gpt-5.1", endpoint = "responses", judge_args = list( reasoning = "low", service_tier = "flex", include_thoughts = FALSE ) )
Given a data frame of samples with columns ID and text,
this function generates all unordered pairs (combinations) of samples.
Each pair appears exactly once, with ID1 < ID2 in
lexicographic order.
make_pairs(samples)make_pairs(samples)
samples |
A tibble or data frame with columns |
A tibble with columns:
ID1, text1
ID2, text2
samples <- tibble::tibble( ID = c("S1", "S2", "S3"), text = c("Sample 1", "Sample 2", "Sample 3") ) pairs_all <- make_pairs(samples) pairs_all # Using the built-in example data data("example_writing_samples") pairs_example <- make_pairs(example_writing_samples) nrow(pairs_example) # should be choose(10, 2) = 45samples <- tibble::tibble( ID = c("S1", "S2", "S3"), text = c("Sample 1", "Sample 2", "Sample 3") ) pairs_all <- make_pairs(samples) pairs_all # Using the built-in example data data("example_writing_samples") pairs_example <- make_pairs(example_writing_samples) nrow(pairs_example) # should be choose(10, 2) = 45
ollama_compare_pair_live() sends a single pairwise comparison prompt to a
local Ollama server and parses the result into the standard pairwiseLLM
tibble format.
ollama_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), host = getOption("pairwiseLLM.ollama_host", "http://127.0.0.1:11434"), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", think = FALSE, num_ctx = 8192L, include_raw = FALSE, ... )ollama_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), host = getOption("pairwiseLLM.ollama_host", "http://127.0.0.1:11434"), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", think = FALSE, num_ctx = 8192L, include_raw = FALSE, ... )
ID1 |
Character ID for the first sample. |
text1 |
Character string containing the first sample's text. |
ID2 |
Character ID for the second sample. |
text2 |
Character string containing the second sample's text. |
model |
Ollama model name (for example |
trait_name |
Short label for the trait (for example
|
trait_description |
Full-text definition of the trait. |
prompt_template |
Prompt template string, typically from
|
host |
Base URL of the Ollama server. Defaults to the option
|
tag_prefix |
Prefix for the better-sample tag. Defaults to
|
tag_suffix |
Suffix for the better-sample tag. Defaults to
|
think |
Logical; if |
num_ctx |
Integer; context window to use via |
include_raw |
Logical; if |
... |
Reserved for future extensions. When |
The function targets the /api/generate endpoint on a running Ollama
instance and expects a single non-streaming response. Model names should
match those available in your Ollama installation (for example
"mistral-small3.2:24b", "qwen3:32b", "gemma3:27b").
Temperature and context length are controlled as follows:
By default, temperature = 0 for all models.
For Qwen models (model names beginning with "qwen") and
think = TRUE, temperature is set to 0.6.
The context window is set via options$num_ctx, which
defaults to 8192L but may be overridden via the num_ctx
argument.
If the Ollama response includes a thinking field (as described in the
Ollama API), that string is stored in the thoughts column of the
returned tibble; otherwise thoughts is NA. This allows
pairwiseLLM to consume Ollama's native thinking output in a way that is
consistent with other backends that expose explicit reasoning traces.
The Ollama backend is intended to be compatible with the existing OpenAI,
Anthropic, and Gemini backends, so the returned tibble can be used
directly with downstream helpers such as build_bt_data() and
fit_bt_model().
In typical workflows, users will call llm_compare_pair() with
backend = "ollama" rather than using
ollama_compare_pair_live() directly. The direct helper is exported
so that advanced users can work with Ollama in a more explicit and
backend-specific way.
The function assumes that:
An Ollama server is running and reachable at host.
The requested model has already been pulled, for example
via ollama pull mistral-small3.2:24b on the command line.
When the Ollama response includes a thinking field (as documented
in the Ollama API), that string is copied into the thoughts column
of the returned tibble; otherwise thoughts is NA. This
parsed thinking output can be logged, inspected, or analyzed alongside
the visible comparison decisions.
A tibble with one row and columns:
custom_id – stable ID for the pair (pair_uid if
supplied via ...; otherwise "LIVE_<ID1>_vs_<ID2>").
ID1, ID2 – the sample IDs supplied to the function.
model – model name reported by the API (or the requested
model).
object_type – backend object type (for example
"ollama.generate").
status_code – HTTP-style status code (200 if
successful).
error_message – error message if something goes wrong;
otherwise NA.
thoughts – reasoning / thinking text when a
thinking field is returned by Ollama; otherwise NA.
content – visible response text from the model (from the
response field).
better_sample – "SAMPLE_1", "SAMPLE_2", or
NA, based on tags found in content.
better_id – ID1 if "SAMPLE_1" is chosen,
ID2 if "SAMPLE_2" is chosen, otherwise NA.
prompt_tokens – prompt / input token count (if reported).
completion_tokens – completion / output token count (if
reported).
total_tokens – total token count (if reported).
raw_response – optional list-column containing the parsed
JSON body (present only when include_raw = TRUE).
submit_ollama_pairs_live() for single-backend, row-wise comparisons.
llm_compare_pair() for backend-agnostic single-pair comparisons.
submit_llm_pairs() for backend-agnostic comparisons over tibbles of
pairs.
## Not run: # Requires a running Ollama server and locally available models. data("example_writing_samples", package = "pairwiseLLM") td <- trait_description("overall_quality") tmpl <- set_prompt_template() ID1 <- example_writing_samples$ID[1] ID2 <- example_writing_samples$ID[2] text1 <- example_writing_samples$text[1] text2 <- example_writing_samples$text[2] # Make sure an Ollama server is running # mistral example res_mistral <- ollama_compare_pair_live( ID1 = ID1, text1 = text1, ID2 = ID2, text2 = text2, model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) res_mistral$better_id # qwen example with reasoning res_qwen_think <- ollama_compare_pair_live( ID1 = ID1, text1 = text1, ID2 = ID2, text2 = text2, model = "qwen3:32b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, think = TRUE, include_raw = TRUE ) res_qwen_think$better_id res_qwen_think$thoughts ## End(Not run)## Not run: # Requires a running Ollama server and locally available models. data("example_writing_samples", package = "pairwiseLLM") td <- trait_description("overall_quality") tmpl <- set_prompt_template() ID1 <- example_writing_samples$ID[1] ID2 <- example_writing_samples$ID[2] text1 <- example_writing_samples$text[1] text2 <- example_writing_samples$text[2] # Make sure an Ollama server is running # mistral example res_mistral <- ollama_compare_pair_live( ID1 = ID1, text1 = text1, ID2 = ID2, text2 = text2, model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) res_mistral$better_id # qwen example with reasoning res_qwen_think <- ollama_compare_pair_live( ID1 = ID1, text1 = text1, ID2 = ID2, text2 = text2, model = "qwen3:32b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, think = TRUE, include_raw = TRUE ) res_qwen_think$better_id res_qwen_think$thoughts ## End(Not run)
This function sends a single pairwise comparison prompt to the OpenAI API
and parses the result into a small tibble. It is the live / on-demand
analogue of build_openai_batch_requests plus
parse_openai_batch_output.
openai_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), endpoint = c("chat.completions", "responses"), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", api_key = NULL, include_raw = FALSE, ... )openai_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), endpoint = c("chat.completions", "responses"), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", api_key = NULL, include_raw = FALSE, ... )
ID1 |
Character ID for the first sample. |
text1 |
Character string containing the first sample's text. |
ID2 |
Character ID for the second sample. |
text2 |
Character string containing the second sample's text. |
model |
OpenAI model name (e.g. "gpt-4.1", "gpt-5.2-2025-12-11"). |
trait_name |
Short label for the trait (e.g. "Overall Quality"). |
trait_description |
Full-text definition of the trait. |
prompt_template |
Prompt template string. |
endpoint |
Which OpenAI endpoint to use: |
tag_prefix |
Prefix for the better-sample tag. |
tag_suffix |
Suffix for the better-sample tag. |
api_key |
Optional OpenAI API key. |
include_raw |
Logical; if TRUE, adds a |
... |
Additional OpenAI parameters, for example
|
It supports both the Chat Completions endpoint ("/v1/chat/completions") and the Responses endpoint ("/v1/responses", for example gpt-5.1 with reasoning), using the same prompt template and model / parameter rules as the batch pipeline.
For the Responses endpoint, the function collects:
Reasoning / "thoughts" text (if available) into the thoughts
column.
Visible assistant output into the content column.
Temperature Defaults:
If temperature is not provided in ...:
It defaults to 0 (deterministic) for standard models or when reasoning is
disabled.
It remains NULL when reasoning is enabled, as the API does not support
temperature in that mode.
A tibble with one row and columns:
Stable ID for the pair (pair_uid if supplied via
...; otherwise "LIVE_<ID1>_vs_<ID2>").
The sample IDs you supplied.
Model name reported by the API.
OpenAI object type (for example "chat.completion" or "response").
HTTP-style status code (200 if successful).
Error message if something goes wrong; otherwise NA.
Reasoning / thinking summary text when available, otherwise NA.
Concatenated text from the assistant's visible output. For
the Responses endpoint this is taken from the type = "message"
output items and does not include reasoning summaries.
"SAMPLE_1", "SAMPLE_2", or NA.
ID1 if SAMPLE_1 is chosen, ID2 if SAMPLE_2 is chosen, otherwise NA.
Prompt / input token count (if reported).
Completion / output token count (if reported).
Total token count (if reported).
(Optional) list-column containing the parsed JSON body.
## Not run: # Requires API key set and internet access # 1. Standard comparison using GPT-4.1 res <- openai_compare_pair_live( ID1 = "A", text1 = "Text A...", ID2 = "B", text2 = "Text B...", model = "gpt-4.1", trait_name = "clarity", trait_description = "Which text is clearer?", temperature = 0 ) # 2. Reasoning comparison using GPT-5.2 res_reasoning <- openai_compare_pair_live( ID1 = "A", text1 = "Text A...", ID2 = "B", text2 = "Text B...", model = "gpt-5.2-2025-12-11", trait_name = "clarity", trait_description = "Which text is clearer?", endpoint = "responses", include_thoughts = TRUE, reasoning = "high", service_tier = "flex" ) print(res_reasoning$thoughts) ## End(Not run)## Not run: # Requires API key set and internet access # 1. Standard comparison using GPT-4.1 res <- openai_compare_pair_live( ID1 = "A", text1 = "Text A...", ID2 = "B", text2 = "Text B...", model = "gpt-4.1", trait_name = "clarity", trait_description = "Which text is clearer?", temperature = 0 ) # 2. Reasoning comparison using GPT-5.2 res_reasoning <- openai_compare_pair_live( ID1 = "A", text1 = "Text A...", ID2 = "B", text2 = "Text B...", model = "gpt-5.2-2025-12-11", trait_name = "clarity", trait_description = "Which text is clearer?", endpoint = "responses", include_thoughts = TRUE, reasoning = "high", service_tier = "flex" ) print(res_reasoning$thoughts) ## End(Not run)
Creates and executes a batch based on a previously uploaded input file.
openai_create_batch( input_file_id, endpoint, completion_window = "24h", metadata = NULL, api_key = NULL )openai_create_batch( input_file_id, endpoint, completion_window = "24h", metadata = NULL, api_key = NULL )
input_file_id |
The ID of the uploaded file (with purpose |
endpoint |
The endpoint for the batch, e.g. |
completion_window |
Time frame in which the batch should be processed.
Currently only |
metadata |
Optional named list of metadata key–value pairs. |
api_key |
Optional OpenAI API key. |
A list representing the Batch object.
## Not run: # Requires OPENAI_API_KEY set in your environment and network access. file_obj <- openai_upload_batch_file("batch_input.jsonl") batch_obj <- openai_create_batch( input_file_id = file_obj$id, endpoint = "/v1/chat/completions" ) batch_obj$status ## End(Not run)## Not run: # Requires OPENAI_API_KEY set in your environment and network access. file_obj <- openai_upload_batch_file("batch_input.jsonl") batch_obj <- openai_create_batch( input_file_id = file_obj$id, endpoint = "/v1/chat/completions" ) batch_obj$status ## End(Not run)
Given a batch ID, retrieves the batch metadata, extracts the
output_file_id, and downloads the corresponding file content to path.
openai_download_batch_output(batch_id, path, api_key = NULL)openai_download_batch_output(batch_id, path, api_key = NULL)
batch_id |
The batch ID (e.g. |
path |
Local file path to write the downloaded |
api_key |
Optional OpenAI API key. |
Invisibly, the path to the downloaded file.
## Not run: # Requires OPENAI_API_KEY and a completed batch with an output_file_id. openai_download_batch_output("batch_abc123", "batch_output.jsonl") # You can then parse the file res <- parse_openai_batch_output("batch_output.jsonl") head(res) ## End(Not run)## Not run: # Requires OPENAI_API_KEY and a completed batch with an output_file_id. openai_download_batch_output("batch_abc123", "batch_output.jsonl") # You can then parse the file res <- parse_openai_batch_output("batch_output.jsonl") head(res) ## End(Not run)
Retrieve an OpenAI batch
openai_get_batch(batch_id, api_key = NULL)openai_get_batch(batch_id, api_key = NULL)
batch_id |
The batch ID (e.g. |
api_key |
Optional OpenAI API key. |
A list representing the Batch object.
## Not run: # Requires OPENAI_API_KEY and an existing batch ID. batch <- openai_get_batch("batch_abc123") batch$status ## End(Not run)## Not run: # Requires OPENAI_API_KEY and an existing batch ID. batch <- openai_get_batch("batch_abc123") batch$status ## End(Not run)
Repeatedly calls openai_get_batch() until the batch reaches a terminal
status (one of "completed", "failed", "cancelled", "expired"),
a timeout is reached, or max_attempts is exceeded.
openai_poll_batch_until_complete( batch_id, interval_seconds = 5, timeout_seconds = 600, max_attempts = Inf, api_key = NULL, verbose = TRUE )openai_poll_batch_until_complete( batch_id, interval_seconds = 5, timeout_seconds = 600, max_attempts = Inf, api_key = NULL, verbose = TRUE )
batch_id |
The batch ID. |
interval_seconds |
Number of seconds to wait between polling attempts. |
timeout_seconds |
Maximum total time to wait in seconds before giving up. |
max_attempts |
Maximum number of polling attempts. This is mainly useful
for testing; default is |
api_key |
Optional OpenAI API key. |
verbose |
Logical; if |
This is a synchronous helper – it will block until one of the conditions above is met.
The final Batch object (a list) as returned by openai_get_batch().
## Not run: # Requires OPENAI_API_KEY and a created batch that may still be running. batch <- openai_create_batch("file_123", endpoint = "/v1/chat/completions") final <- openai_poll_batch_until_complete( batch_id = batch$id, interval_seconds = 10, timeout_seconds = 3600 ) final$status ## End(Not run)## Not run: # Requires OPENAI_API_KEY and a created batch that may still be running. batch <- openai_create_batch("file_123", endpoint = "/v1/chat/completions") final <- openai_poll_batch_until_complete( batch_id = batch$id, interval_seconds = 10, timeout_seconds = 3600 ) final$status ## End(Not run)
Uploads a .jsonl file to the OpenAI Files API with purpose "batch",
which can then be used to create a Batch job.
openai_upload_batch_file(path, purpose = "batch", api_key = NULL)openai_upload_batch_file(path, purpose = "batch", api_key = NULL)
path |
Path to the local |
purpose |
File purpose. For the Batch API this should be |
api_key |
Optional OpenAI API key. Defaults to
|
A list representing the File object returned by the API, including
id, filename, bytes, purpose, etc.
## Not run: # Requires OPENAI_API_KEY set in your environment and network access file_obj <- openai_upload_batch_file("batch_input.jsonl") file_obj$id ## End(Not run)## Not run: # Requires OPENAI_API_KEY set in your environment and network access file_obj <- openai_upload_batch_file("batch_input.jsonl") file_obj$id ## End(Not run)
This function parses a .jsonl file produced by
anthropic_download_batch_results. Each line in the file
is a JSON object with at least:
parse_anthropic_batch_output( jsonl_path, tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>" )parse_anthropic_batch_output( jsonl_path, tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>" )
jsonl_path |
Path to a |
tag_prefix |
Prefix for the better-sample tag. Defaults to
|
tag_suffix |
Suffix for the better-sample tag. Defaults to
|
{
"custom_id": "ANTH_S01_vs_S02",
"result": {
"type": "succeeded" | "errored" | "canceled" | "expired",
"message": { ... } # when type == "succeeded"
"error": { ... } # when type == "errored" (optional)
}
}
Results may be returned in any order. This function uses the
custom_id field to recover ID1 and ID2 and then
applies the same parsing logic as anthropic_compare_pair_live,
including extraction of extended thinking blocks (when enabled) into
a separate thoughts column.
A tibble with one row per result. The columns mirror
anthropic_compare_pair_live with batch-specific additions:
Batch custom ID (for example "ANTH_S01_vs_S02").
Sample IDs recovered from custom_id.
Model name reported by Anthropic.
Anthropic object type (for example "message").
HTTP-style status code (200 for succeeded results,
NA otherwise).
One of "succeeded", "errored",
"canceled", "expired".
Error message for non-succeeded results, otherwise NA.
Extended thinking text returned by Claude when reasoning
is enabled (for example when reasoning = "enabled"),
otherwise NA.
Concatenated assistant text for succeeded results.
"SAMPLE_1", "SAMPLE_2", or NA.
ID1 if SAMPLE_1 is chosen, ID2 if SAMPLE_2 is chosen, otherwise NA.
Prompt / input token count (if reported).
Completion / output token count (if reported).
Total token count (reported or computed upstream).
## Not run: # Requires a completed Anthropic batch file tbl <- parse_anthropic_batch_output("anthropic-results.jsonl") ## End(Not run)## Not run: # Requires a completed Anthropic batch file tbl <- parse_anthropic_batch_output("anthropic-results.jsonl") ## End(Not run)
This reads a JSONL file created by gemini_download_batch_results() and
converts each line into a row that mirrors the structure used for live
Gemini calls, including a thoughts column when the batch was run with
include_thoughts = TRUE.
parse_gemini_batch_output(results_path, requests_tbl)parse_gemini_batch_output(results_path, requests_tbl)
results_path |
Path to the JSONL file produced by
|
requests_tbl |
Tibble/data frame with at least columns |
A tibble with one row per request and columns:
custom_id, ID1, ID2
model, object_type, status_code, result_type, error_message
thoughts, thought_signature, thoughts_token_count
content, better_sample, better_id
prompt_tokens, completion_tokens, total_tokens
#' # This example assumes you have already: # 1. Built Gemini batch requests with `build_gemini_batch_requests()` # 2. Submitted and completed a batch job via the Gemini API # 3. Downloaded the results using `gemini_download_batch_results()` ## Not run: # Path to a JSONL file created by `gemini_download_batch_results()` results_path <- "gemini_batch_results.jsonl" # Requests table used to build the batch (must contain custom_id, ID1, ID2) # as returned by `build_gemini_batch_requests()` requests_tbl <- readRDS("gemini_batch_requests.rds") # Parse batch output into a tidy tibble of pairwise results results <- parse_gemini_batch_output( results_path = results_path, requests_tbl = requests_tbl ) results ## End(Not run)#' # This example assumes you have already: # 1. Built Gemini batch requests with `build_gemini_batch_requests()` # 2. Submitted and completed a batch job via the Gemini API # 3. Downloaded the results using `gemini_download_batch_results()` ## Not run: # Path to a JSONL file created by `gemini_download_batch_results()` results_path <- "gemini_batch_results.jsonl" # Requests table used to build the batch (must contain custom_id, ID1, ID2) # as returned by `build_gemini_batch_requests()` requests_tbl <- readRDS("gemini_batch_requests.rds") # Parse batch output into a tidy tibble of pairwise results results <- parse_gemini_batch_output( results_path = results_path, requests_tbl = requests_tbl ) results ## End(Not run)
This function reads an OpenAI Batch API output file (JSONL) and extracts
pairwise comparison results for use with Bradley–Terry models. It supports
both the Chat Completions endpoint (where object = "chat.completion")
and the Responses endpoint (where object = "response"), including
GPT-5.1 with reasoning.
parse_openai_batch_output( path, tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>" )parse_openai_batch_output( path, tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>" )
path |
Path to a JSONL output file downloaded from the OpenAI Batch API. |
tag_prefix |
Character string marking the start of the better-sample
tag. Defaults to |
tag_suffix |
Character string marking the end of the better-sample
tag. Defaults to |
For each line, the function:
extracts custom_id and parses ID1 and ID2
from the pattern "<prefix>ID1_vs_ID2",
pulls the raw LLM content containing the
<BETTER_SAMPLE>...</BETTER_SAMPLE> tag,
determines whether SAMPLE_1 or SAMPLE_2 was
selected and maps that to better_id,
collects model name and token usage statistics (including reasoning tokens for GPT-5.1 Responses),
when using the Responses endpoint with reasoning, separates
reasoning summaries into the thoughts column and visible
assistant output into content.
The returned data frame is suitable as input for
build_bt_data.
A tibble with one row per successfully parsed comparison and columns:
The custom_id from the batch request.
Sample IDs inferred from custom_id.
The model name reported by the API.
The OpenAI response object type
(e.g., "chat.completion" or "response").
HTTP-style status code from the batch output.
Error message, if present; otherwise NA.
Reasoning / thinking summary text when available
(for Responses with reasoning); otherwise NA.
The raw assistant visible content string (the LLM's
output), used to locate the <BETTER_SAMPLE> tag. For
Responses with reasoning this does not include reasoning
summaries, which are kept in thoughts.
Either "SAMPLE_1", "SAMPLE_2",
or NA if the tag was not found.
ID1 if SAMPLE_1 was chosen, ID2
if SAMPLE_2 was chosen, or NA.
Prompt/input token count (if reported).
Completion/output token count (if reported).
Total tokens (if reported).
Cached prompt tokens (if reported via
input_tokens_details$cached_tokens); otherwise NA.
Reasoning tokens (if reported via
output_tokens_details$reasoning_tokens); otherwise
NA.
# Create a temporary JSONL file containing a simulated OpenAI batch result tf <- tempfile(fileext = ".jsonl") # A single line of JSON representing a successful Chat Completion # custom_id implies "LIVE_" prefix, ID1="A", ID2="B" json_line <- paste0( '{"custom_id": "LIVE_A_vs_B", ', '"response": {"status_code": 200, "body": {', '"object": "chat.completion", ', '"model": "gpt-4", ', '"choices": [{"message": {"content": "<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>"}}], ', '"usage": {"prompt_tokens": 50, "completion_tokens": 10, "total_tokens": 60}}}}' ) writeLines(json_line, tf) # Parse the output res <- parse_openai_batch_output(tf) # Inspect the result print(res$better_id) print(res$prompt_tokens) # Clean up unlink(tf)# Create a temporary JSONL file containing a simulated OpenAI batch result tf <- tempfile(fileext = ".jsonl") # A single line of JSON representing a successful Chat Completion # custom_id implies "LIVE_" prefix, ID1="A", ID2="B" json_line <- paste0( '{"custom_id": "LIVE_A_vs_B", ', '"response": {"status_code": 200, "body": {', '"object": "chat.completion", ', '"model": "gpt-4", ', '"choices": [{"message": {"content": "<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>"}}], ', '"usage": {"prompt_tokens": 50, "completion_tokens": 10, "total_tokens": 60}}}}' ) writeLines(json_line, tf) # Parse the output res <- parse_openai_batch_output(tf) # Inspect the result print(res$better_id) print(res$prompt_tokens) # Clean up unlink(tf)
S3 method for printing adaptive_state objects.
## S3 method for class 'adaptive_state' print(x, ...)## S3 method for class 'adaptive_state' print(x, ...)
x |
An |
... |
Unused. |
x, invisibly.
state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) print(state)state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) print(state)
Prints a compact, human-readable summary of an object returned by
estimate_llm_pairs_cost. The print method reports the backend,
model, pilot/remaining pair counts, estimated token totals, and both the
expected and budget cost estimates.
## S3 method for class 'pairwiseLLM_cost_estimate' print(x, ...)## S3 method for class 'pairwiseLLM_cost_estimate' print(x, ...)
x |
An object of class |
... |
Unused. Included for method compatibility. |
x, invisibly.
## Not run: data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 50, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() est <- estimate_llm_pairs_cost( pairs = pairs, backend = "openai", model = "gpt-4.1", endpoint = "chat.completions", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, mode = "batch", batch_discount = 0.5, n_test = 10, cost_per_million_input = 0.15, cost_per_million_output = 0.60 ) est ## End(Not run)## Not run: data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 50, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() est <- estimate_llm_pairs_cost( pairs = pairs, backend = "openai", model = "gpt-4.1", endpoint = "chat.completions", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, mode = "batch", batch_discount = 0.5, n_test = 10, cost_per_million_input = 0.15, cost_per_million_output = 0.60 ) est ## End(Not run)
This helper takes a table of paired writing samples (with columns
ID1, text1, ID2, and text2) and, for each row,
randomly decides whether to keep the current order or swap the two samples.
The result is that approximately half of the pairs will have the original
order and half will be reversed, on average.
randomize_pair_order(pairs, seed = NULL)randomize_pair_order(pairs, seed = NULL)
pairs |
A data frame or tibble with columns |
seed |
Optional integer seed for reproducible randomization. If
|
This is useful for reducing position biases in LLM-based paired comparisons,
while still allowing reverse-order consistency checks via
sample_reverse_pairs and
compute_reverse_consistency.
If you want a deterministic alternation of positions (for example,
first pair as-is, second pair swapped, third pair as-is, and so on), use
alternate_pair_order instead of this function.
A tibble with the same columns as pairs, but with some rows'
ID1/text1 and ID2/text2 swapped at random.
alternate_pair_order for deterministic alternating order,
sample_reverse_pairs and
compute_reverse_consistency for reverse-order checks.
data("example_writing_samples", package = "pairwiseLLM") # Build all pairs pairs_all <- make_pairs(example_writing_samples) # Randomly flip the order within pairs set.seed(123) pairs_rand <- randomize_pair_order(pairs_all, seed = 123) head(pairs_all[, c("ID1", "ID2")]) head(pairs_rand[, c("ID1", "ID2")])data("example_writing_samples", package = "pairwiseLLM") # Build all pairs pairs_all <- make_pairs(example_writing_samples) # Randomly flip the order within pairs set.seed(123) pairs_rand <- randomize_pair_order(pairs_all, seed = 123) head(pairs_all[, c("ID1", "ID2")]) head(pairs_rand[, c("ID1", "ID2")])
This function extracts ID and text columns from a data frame and enforces that IDs are unique. By default, it assumes the first column is the ID and the second column is the text.
read_samples_df(df, id_col = 1, text_col = 2)read_samples_df(df, id_col = 1, text_col = 2)
df |
A data frame or tibble containing at least two columns. |
id_col |
Column specifying the IDs. Can be a column name (string) or a column index (integer). Defaults to 1. |
text_col |
Column specifying the writing samples (character). Can be a column name or index. Defaults to 2. |
A tibble with columns:
ID: character ID for each sample
text: character string of the writing sample
Any remaining columns in df are retained unchanged.
df <- data.frame( StudentID = c("S1", "S2"), Response = c("This is sample 1.", "This is sample 2."), Grade = c(8, 9), stringsAsFactors = FALSE ) samples <- read_samples_df(df, id_col = "StudentID", text_col = "Response") samples # Using the built-in example dataset data("example_writing_samples") samples2 <- read_samples_df( example_writing_samples[, c("ID", "text")], id_col = "ID", text_col = "text" ) head(samples2)df <- data.frame( StudentID = c("S1", "S2"), Response = c("This is sample 1.", "This is sample 2."), Grade = c(8, 9), stringsAsFactors = FALSE ) samples <- read_samples_df(df, id_col = "StudentID", text_col = "Response") samples # Using the built-in example dataset data("example_writing_samples") samples2 <- read_samples_df( example_writing_samples[, c("ID", "text")], id_col = "ID", text_col = "text" ) head(samples2)
This function reads all text files in a directory and uses the filename (without extension) as the sample ID and the file contents as the text.
read_samples_dir(path = ".", pattern = "\\.txt$")read_samples_dir(path = ".", pattern = "\\.txt$")
path |
Directory containing .txt files. |
pattern |
A regular expression used to match file names.
Defaults to |
A tibble with columns:
ID: filename without extension
text: file contents as a single character string
# Create a temporary directory with sample text files samples_dir <- tempfile() dir.create(samples_dir) writeLines("This is sample A.", file.path(samples_dir, "A.txt")) writeLines("This is sample B.", file.path(samples_dir, "B.txt")) # Read samples into a tibble samples <- read_samples_dir(samples_dir) samples# Create a temporary directory with sample text files samples_dir <- tempfile() dir.create(samples_dir) writeLines("This is sample A.", file.path(samples_dir, "A.txt")) writeLines("This is sample B.", file.path(samples_dir, "B.txt")) # Read samples into a tibble samples <- read_samples_dir(samples_dir) samples
This function validates a template (or reads it from a file) and stores it under a user-provided name for reuse in the current R session. Registered templates live in a package-internal registry.
register_prompt_template(name, template = NULL, file = NULL, overwrite = FALSE)register_prompt_template(name, template = NULL, file = NULL, overwrite = FALSE)
name |
Character scalar; name under which to store the template. |
template |
Optional character string containing a custom template.
If |
file |
Optional path to a text file containing a template.
Ignored if |
overwrite |
Logical; if |
To make templates persistent across sessions, call this function
in your .Rprofile or in a project startup script.
Any template must contain the placeholders
{TRAIT_NAME}, {TRAIT_DESCRIPTION},
{SAMPLE_1}, and {SAMPLE_2}.
Invisibly, the validated template string.
# Register a custom template for this session custom <- " You are an expert writing assessor for {TRAIT_NAME}. {TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}. Which of the samples below is better on {TRAIT_NAME}? SAMPLE 1: {SAMPLE_1} SAMPLE 2: {SAMPLE_2} <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE> " register_prompt_template("my_custom", template = custom) # Retrieve and inspect it tmpl <- get_prompt_template("my_custom") cat(substr(tmpl, 1, 160), "...\n")# Register a custom template for this session custom <- " You are an expert writing assessor for {TRAIT_NAME}. {TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}. Which of the samples below is better on {TRAIT_NAME}? SAMPLE 1: {SAMPLE_1} SAMPLE 2: {SAMPLE_2} <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE> " register_prompt_template("my_custom", template = custom) # Retrieve and inspect it tmpl <- get_prompt_template("my_custom") cat(substr(tmpl, 1, 160), "...\n")
This function removes a template from the user registry created by
register_prompt_template. It does not affect built-in
templates stored under inst/templates.
remove_prompt_template(name, quiet = FALSE)remove_prompt_template(name, quiet = FALSE)
name |
Character scalar; name of the template to remove. |
quiet |
Logical; if |
Invisibly, TRUE if a template was removed,
FALSE otherwise.
register_prompt_template,
get_prompt_template,
list_prompt_templates
# Register and then remove a template register_prompt_template("to_delete", template = set_prompt_template()) remove_prompt_template("to_delete")# Register and then remove a template register_prompt_template("to_delete", template = set_prompt_template()) remove_prompt_template("to_delete")
This high-level helper mirrors run_openai_batch_pipeline but
targets Anthropic's Message Batches API. It:
run_anthropic_batch_pipeline( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), reasoning = c("none", "enabled"), include_thoughts = FALSE, batch_input_path = NULL, batch_output_path = NULL, poll = TRUE, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01", verbose = TRUE, ... )run_anthropic_batch_pipeline( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), reasoning = c("none", "enabled"), include_thoughts = FALSE, batch_input_path = NULL, batch_output_path = NULL, poll = TRUE, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("ANTHROPIC_API_KEY"), anthropic_version = "2023-06-01", verbose = TRUE, ... )
pairs |
Tibble or data frame with at least columns |
model |
Anthropic model name (for example |
trait_name |
Trait name to pass to
|
trait_description |
Trait description to pass to
|
prompt_template |
Prompt template string, typically from
|
reasoning |
Character scalar; one of |
include_thoughts |
Logical; if |
batch_input_path |
Path to write the JSON file containing the
|
batch_output_path |
Path to write the downloaded |
poll |
Logical; if |
interval_seconds |
Polling interval in seconds (used when
|
timeout_seconds |
Maximum total time in seconds for polling before
giving up (used when |
api_key |
Optional Anthropic API key. Defaults to
|
anthropic_version |
Anthropic API version string passed as the
|
verbose |
Logical; if |
... |
Additional Anthropic parameters forwarded to
|
Builds Anthropic batch requests from a tibble of pairs using
build_anthropic_batch_requests.
Writes a JSON file containing the requests object for
reproducibility.
Creates a Message Batch via anthropic_create_batch.
Optionally polls until the batch reaches processing_status =
"ended" using anthropic_poll_batch_until_complete.
If polling is enabled, downloads the .jsonl result file with
anthropic_download_batch_results and parses it via
parse_anthropic_batch_output.
It is the Anthropic analogue of run_openai_batch_pipeline and
returns a list with the same overall structure so that downstream code can
treat the two backends uniformly.
When include_thoughts = TRUE and reasoning is left at its
default of "none", this function automatically upgrades
reasoning to "enabled" so that Claude's extended thinking
blocks are returned and parsed into the thoughts column by
parse_anthropic_batch_output.
Temperature and reasoning defaults
Temperature and thinking-mode behaviour are controlled by
build_anthropic_batch_requests:
When reasoning = "none" (no extended thinking):
The default temperature is 0 (deterministic),
unless you explicitly supply a temperature argument via
....
The default max_tokens is 768, unless you
override it via max_tokens in ....
When reasoning = "enabled" (extended thinking enabled):
temperature must be 1. If you supply a
different value in ...,
build_anthropic_batch_requests() will throw an error.
By default, max_tokens = 2048 and
thinking_budget_tokens = 1024, subject to the constraint
1024 <= thinking_budget_tokens < max_tokens. Violations of
this constraint also produce an error.
Therefore, when you run batches without extended thinking (the usual case),
the effective default is a temperature of 0. When you explicitly use
extended thinking (either by setting reasoning = "enabled" or by
using include_thoughts = TRUE), Anthropic's requirement of
temperature = 1 is enforced.
A list with elements (aligned with
run_openai_batch_pipeline):
Path to the JSON file containing the batch
requests object.
Path to the downloaded .jsonl results file
if poll = TRUE, otherwise NULL.
Always NULL for Anthropic batches (OpenAI uses a File
object here). Included for structural compatibility.
Message Batch object; if poll = TRUE, this is the final
batch after polling, otherwise the initial batch returned by
anthropic_create_batch.
Parsed tibble from
parse_anthropic_batch_output if poll = TRUE,
otherwise NULL.
## Not run: # Requires ANTHROPIC_API_KEY and network access. library(pairwiseLLM) data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Standard batch without extended thinking pipeline_none <- run_anthropic_batch_pipeline( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none", include_thoughts = FALSE, interval_seconds = 60, timeout_seconds = 3600, verbose = TRUE ) pipeline_none$batch$processing_status head(pipeline_none$results) # Batch with extended thinking and thoughts column pipeline_thoughts <- run_anthropic_batch_pipeline( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = TRUE, interval_seconds = 60, timeout_seconds = 3600, verbose = TRUE ) pipeline_thoughts$batch$processing_status head(pipeline_thoughts$results) ## End(Not run)## Not run: # Requires ANTHROPIC_API_KEY and network access. library(pairwiseLLM) data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Standard batch without extended thinking pipeline_none <- run_anthropic_batch_pipeline( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none", include_thoughts = FALSE, interval_seconds = 60, timeout_seconds = 3600, verbose = TRUE ) pipeline_none$batch$processing_status head(pipeline_none$results) # Batch with extended thinking and thoughts column pipeline_thoughts <- run_anthropic_batch_pipeline( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, include_thoughts = TRUE, interval_seconds = 60, timeout_seconds = 3600, verbose = TRUE ) pipeline_thoughts$batch$processing_status head(pipeline_thoughts$results) ## End(Not run)
This helper ties together the core batch operations:
Build batch requests from a tibble of pairs.
Create a Batch job via gemini_create_batch.
Optionally poll for completion and download results.
Parse the JSONL results into a tibble via
parse_gemini_batch_output.
run_gemini_batch_pipeline( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), thinking_level = "low", batch_input_path = tempfile(pattern = "gemini-batch-input-", fileext = ".json"), batch_output_path = tempfile(pattern = "gemini-batch-output-", fileext = ".jsonl"), poll = TRUE, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta", verbose = TRUE, include_thoughts = FALSE, ... )run_gemini_batch_pipeline( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), thinking_level = "low", batch_input_path = tempfile(pattern = "gemini-batch-input-", fileext = ".json"), batch_output_path = tempfile(pattern = "gemini-batch-output-", fileext = ".jsonl"), poll = TRUE, interval_seconds = 60, timeout_seconds = 86400, api_key = Sys.getenv("GEMINI_API_KEY"), api_version = "v1beta", verbose = TRUE, include_thoughts = FALSE, ... )
pairs |
Tibble/data frame of pairs. |
model |
Gemini model name, for example |
trait_name |
Trait name. |
trait_description |
Trait description. |
prompt_template |
Prompt template string. |
thinking_level |
One of This controls the maximum depth of internal reasoning for Gemini batch
requests via
|
batch_input_path |
Path where the batch input JSON should be written. |
batch_output_path |
Path where the batch output JSONL should be written
(only used if |
poll |
Logical; if |
interval_seconds |
Polling interval when |
timeout_seconds |
Maximum total waiting time when |
api_key |
Optional Gemini API key. |
api_version |
API version string. |
verbose |
Logical; if |
include_thoughts |
Logical; if |
... |
Additional arguments forwarded to
|
The returned list mirrors the structure of
run_openai_batch_pipeline and
run_anthropic_batch_pipeline.
A list with elements:
Path to the written batch input JSON.
Path to the batch output JSONL (or NULL
when poll = FALSE).
Reserved for parity with OpenAI/Anthropic; always NULL
for Gemini inline batches.
The created Batch job object.
Parsed tibble of results (or NULL when
poll = FALSE).
# This example requires: # - A valid Gemini API key (set in GEMINI_API_KEY) # - Internet access # - Billable Gemini API usage ## Not run: # Example pairwise data data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Run the full Gemini batch pipeline (Gemini 3 Pro example) res <- run_gemini_batch_pipeline( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low", poll = TRUE, include_thoughts = FALSE ) # Parsed pairwise comparison results res$results # Inspect batch metadata res$batch # Paths to saved input/output files res$batch_input_path res$batch_output_path # Gemini 3 Flash example (minimal thinking) res_flash <- run_gemini_batch_pipeline( pairs = pairs, model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", poll = TRUE, include_thoughts = FALSE ) res_flash$results ## End(Not run)# This example requires: # - A valid Gemini API key (set in GEMINI_API_KEY) # - Internet access # - Billable Gemini API usage ## Not run: # Example pairwise data data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Run the full Gemini batch pipeline (Gemini 3 Pro example) res <- run_gemini_batch_pipeline( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "low", poll = TRUE, include_thoughts = FALSE ) # Parsed pairwise comparison results res$results # Inspect batch metadata res$batch # Paths to saved input/output files res$batch_input_path res$batch_output_path # Gemini 3 Flash example (minimal thinking) res_flash <- run_gemini_batch_pipeline( pairs = pairs, model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", poll = TRUE, include_thoughts = FALSE ) res_flash$results ## End(Not run)
This helper wires together the existing pieces:
optionally openai_poll_batch_until_complete()
optionally openai_download_batch_output()
optionally parse_openai_batch_output()
run_openai_batch_pipeline( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), include_thoughts = FALSE, include_raw = FALSE, endpoint = NULL, batch_input_path = tempfile("openai_batch_input_", fileext = ".jsonl"), batch_output_path = tempfile("openai_batch_output_", fileext = ".jsonl"), poll = TRUE, interval_seconds = 5, timeout_seconds = 600, max_attempts = Inf, metadata = NULL, api_key = NULL, ... )run_openai_batch_pipeline( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), include_thoughts = FALSE, include_raw = FALSE, endpoint = NULL, batch_input_path = tempfile("openai_batch_input_", fileext = ".jsonl"), batch_output_path = tempfile("openai_batch_output_", fileext = ".jsonl"), poll = TRUE, interval_seconds = 5, timeout_seconds = 600, max_attempts = Inf, metadata = NULL, api_key = NULL, ... )
pairs |
Tibble of pairs with at least |
model |
OpenAI model name (e.g. |
trait_name |
Trait name to pass to |
trait_description |
Trait description to pass to
|
prompt_template |
Prompt template string, typically from
|
include_thoughts |
Logical; if |
include_raw |
Logical; if |
endpoint |
One of |
batch_input_path |
Path to write the batch input |
batch_output_path |
Path to write the batch output |
poll |
Logical; if |
interval_seconds |
Polling interval in seconds
(used when |
timeout_seconds |
Maximum total time in seconds for polling before
giving up (used when |
max_attempts |
Maximum number of polling attempts (primarily useful for testing). |
metadata |
Optional named list of metadata key–value pairs to pass to
|
api_key |
Optional OpenAI API key. Defaults to
|
... |
Additional arguments passed through to
|
It is a convenience wrapper around these smaller functions and is intended for end-to-end batch runs on a set of pairwise comparisons. For more control (or testing), you can call the components directly.
When endpoint is not specified, it is chosen automatically:
if include_thoughts = TRUE or GPT-5 reasoning is requested,
the "responses" endpoint is used and a default reasoning
effort of "low" is applied for GPT-5 series models unless
overridden via reasoning.
otherwise, "chat.completions" is used.
A list with elements:
batch_input_path – path to the input .jsonl file.
batch_output_path – path to the output .jsonl file (or NULL if
poll = FALSE).
file – File object returned by openai_upload_batch_file().
batch – Batch object; if poll = TRUE, this is the final
batch after polling, otherwise the initial batch returned by
openai_create_batch().
results – Parsed tibble from parse_openai_batch_output() if
poll = TRUE, otherwise NULL.
# The OpenAI batch pipeline requires: # - Internet access # - A valid OpenAI API key in OPENAI_API_KEY (or supplied via `api_key`) # - Billable API usage # ## Not run: data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 2, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Run a small batch using chat.completions out <- run_openai_batch_pipeline( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, endpoint = "chat.completions", poll = TRUE, interval_seconds = 5, timeout_seconds = 600 ) print(out$batch$status) print(utils::head(out$results)) ## End(Not run)# The OpenAI batch pipeline requires: # - Internet access # - A valid OpenAI API key in OPENAI_API_KEY (or supplied via `api_key`) # - Billable API usage # ## Not run: data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 2, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Run a small batch using chat.completions out <- run_openai_batch_pipeline( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, endpoint = "chat.completions", poll = TRUE, interval_seconds = 5, timeout_seconds = 600 ) print(out$batch$status) print(utils::head(out$results)) ## End(Not run)
This function samples a subset of rows from a pairs data frame
returned by make_pairs. You can specify either the
proportion of pairs to retain (pair_pct), the absolute number
of pairs (n_pairs), or both (in which case the minimum of the
two is used).
sample_pairs(pairs, pair_pct = 1, n_pairs = NULL, seed = NULL)sample_pairs(pairs, pair_pct = 1, n_pairs = NULL, seed = NULL)
pairs |
A tibble with columns |
pair_pct |
Proportion of pairs to sample (between 0 and 1). Defaults to 1 (all pairs). |
n_pairs |
Optional integer specifying the maximum number of pairs to sample. |
seed |
Optional integer seed for reproducible sampling. |
A tibble containing the sampled rows of pairs.
samples <- tibble::tibble( ID = c("S1", "S2", "S3", "S4"), text = paste("Sample", 1:4) ) pairs_all <- make_pairs(samples) # Sample 50% of all pairs sample_pairs(pairs_all, pair_pct = 0.5, seed = 123) # Sample exactly 3 pairs sample_pairs(pairs_all, n_pairs = 3, seed = 123) # Using built-in examples and sample 10% of all pairs data("example_writing_samples") pairs_ex <- make_pairs(example_writing_samples) pairs_ex_sample <- sample_pairs(pairs_ex, pair_pct = 0.10, seed = 1) nrow(pairs_ex_sample)samples <- tibble::tibble( ID = c("S1", "S2", "S3", "S4"), text = paste("Sample", 1:4) ) pairs_all <- make_pairs(samples) # Sample 50% of all pairs sample_pairs(pairs_all, pair_pct = 0.5, seed = 123) # Sample exactly 3 pairs sample_pairs(pairs_all, n_pairs = 3, seed = 123) # Using built-in examples and sample 10% of all pairs data("example_writing_samples") pairs_ex <- make_pairs(example_writing_samples) pairs_ex_sample <- sample_pairs(pairs_ex, pair_pct = 0.10, seed = 1) nrow(pairs_ex_sample)
Given a table of pairs with columns ID1, text1,
ID2, and text2, this function selects a subset
of rows and returns a new tibble where the order of each selected
pair is reversed.
sample_reverse_pairs(pairs, reverse_pct = NULL, n_reverse = NULL, seed = NULL)sample_reverse_pairs(pairs, reverse_pct = NULL, n_reverse = NULL, seed = NULL)
pairs |
A data frame or tibble with columns |
reverse_pct |
Optional proportion of rows to reverse
(between 0 and 1). If |
n_reverse |
Optional absolute number of rows to reverse.
If supplied, this takes precedence over |
seed |
Optional integer seed for reproducible sampling. |
A tibble containing the reversed pairs only (i.e., with
ID1 swapped with ID2 and text1 swapped with
text2).
data("example_writing_samples") pairs <- make_pairs(example_writing_samples) # Reverse 20% of the pairs rev20 <- sample_reverse_pairs(pairs, reverse_pct = 0.2, seed = 123)data("example_writing_samples") pairs <- make_pairs(example_writing_samples) # Reverse 20% of the pairs rev20 <- sample_reverse_pairs(pairs, reverse_pct = 0.2, seed = 123)
Save an adaptive session to disk.
save_adaptive_session(state, session_dir, overwrite = FALSE)save_adaptive_session(state, session_dir, overwrite = FALSE)
state |
Adaptive state. |
session_dir |
Directory to write session artifacts. |
overwrite |
Logical; overwrite existing artifacts. |
Saves canonical Adaptive artifacts under session_dir:
state.rds, step_log.rds, round_log.rds,
metadata.rds, optional btl_fit.rds, and optional per-refit item
log files when state$config$persist_item_log is TRUE. Writes
are atomic at file level to reduce partial-write risk. Persisted
step_log/round_log files keep the full canonical schemas, so
resume preserves expanded audit fields without recomputation.
The session_dir path, invisibly.
validate_session_dir(), load_adaptive_session()
Other adaptive persistence:
load_adaptive_session(),
validate_session_dir()
dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) save_adaptive_session(state, dir, overwrite = TRUE)dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) save_adaptive_session(state, dir, overwrite = TRUE)
This function returns a default prompt template that includes
placeholders for the trait name, trait description, and two
writing samples. Any custom template must contain the
placeholders {TRAIT_NAME}, {TRAIT_DESCRIPTION},
{SAMPLE_1}, and {SAMPLE_2}.
set_prompt_template(template = NULL, file = NULL)set_prompt_template(template = NULL, file = NULL)
template |
Optional character string containing a custom template.
If |
file |
Optional path to a text file containing a template.
Ignored if |
The default template is stored as a plain-text file in
inst/templates/default.txt and loaded at run time. This
makes it easy to inspect and modify the prompt text without
changing the R code.
A character string containing the prompt template.
# Get the default template shipped with the package tmpl <- set_prompt_template() cat(substr(tmpl, 1, 200), "...\n") # Use a custom template defined in-line custom <- " You are an expert writing assessor for {TRAIT_NAME}. {TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}. Which of the samples below is better on {TRAIT_NAME}? SAMPLE 1: {SAMPLE_1} SAMPLE 2: {SAMPLE_2} <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE> " tmpl2 <- set_prompt_template(template = custom) cat(substr(tmpl2, 1, 120), "...\n")# Get the default template shipped with the package tmpl <- set_prompt_template() cat(substr(tmpl, 1, 200), "...\n") # Use a custom template defined in-line custom <- " You are an expert writing assessor for {TRAIT_NAME}. {TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}. Which of the samples below is better on {TRAIT_NAME}? SAMPLE 1: {SAMPLE_1} SAMPLE 2: {SAMPLE_2} <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE> " tmpl2 <- set_prompt_template(template = custom) cat(substr(tmpl2, 1, 120), "...\n")
This is a robust row-wise wrapper around
anthropic_compare_pair_live. It takes a tibble of pairs
(ID1 / text1 / ID2 / text2), submits each pair to the Anthropic
Messages API, and collects the results.
submit_anthropic_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, anthropic_version = "2023-06-01", reasoning = c("none", "enabled"), verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, include_thoughts = NULL, save_path = NULL, parallel = FALSE, workers = 1, ... )submit_anthropic_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, anthropic_version = "2023-06-01", reasoning = c("none", "enabled"), verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, include_thoughts = NULL, save_path = NULL, parallel = FALSE, workers = 1, ... )
pairs |
Tibble or data frame with at least columns |
model |
Anthropic model name (for example |
trait_name |
Trait name to pass to |
trait_description |
Trait description to pass to
|
prompt_template |
Prompt template string, typically from
|
api_key |
Optional Anthropic API key. Defaults to
|
anthropic_version |
Anthropic API version string passed as the
|
reasoning |
Character scalar passed to
|
verbose |
Logical; if |
status_every |
Integer; print status / timing for every
|
progress |
Logical; if |
include_raw |
Logical; if |
include_thoughts |
Logical or |
save_path |
Character string; optional file path (e.g., "output.csv")
to save results incrementally. If the file exists, the function reads it
to identify and skip pairs that have already been processed (resume mode).
Requires the |
parallel |
Logical; if |
workers |
Integer; the number of parallel workers (threads) to use if
|
... |
Additional Anthropic parameters (for example |
This function offers:
Parallel Processing: Uses the future package to process
multiple pairs simultaneously.
Incremental Saving: Writes results to a CSV file as they complete.
If the process is interrupted, re-running the function with the same
save_path will automatically skip pairs that were already successfully processed.
Error Separation: Returns valid results and failed pairs separately, making it easier to debug or retry specific failures.
Temperature and reasoning behaviour
Temperature and extended-thinking behaviour are controlled by
anthropic_compare_pair_live:
When reasoning = "none" (no extended thinking), the default
temperature is 0 (deterministic) unless you explicitly
supply a different temperature via ....
When reasoning = "enabled" (extended thinking), Anthropic
requires temperature = 1. If you supply a different value, an
error is raised by anthropic_compare_pair_live.
If you set include_thoughts = TRUE while reasoning = "none",
the underlying calls upgrade to reasoning = "enabled", which in turn
implies temperature = 1 and adds a thinking block to the API
request. When include_thoughts = FALSE (the default), and you leave
reasoning = "none", the effective default temperature is 0.
A list containing three elements:
A tibble with one row per successfully processed pair.
A tibble containing the rows from pairs that
failed to process (due to API errors or timeouts), along with an
error_message column.
A tibble of attempt-level failures (retries, timeouts, parse errors, invalid winners), separate from observed outcomes.
## Not run: # Requires ANTHROPIC_API_KEY and network access. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving res_claude <- submit_anthropic_pairs_live( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none", save_path = "results_seq.csv" ) # 2. Parallel execution (faster) res_par <- submit_anthropic_pairs_live( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_par.csv", parallel = TRUE, workers = 4 ) # Inspect results head(res_par$results) # Check for failures if (nrow(res_par$failed_pairs) > 0) { message("Some pairs failed:") print(res_par$failed_pairs) } ## End(Not run)## Not run: # Requires ANTHROPIC_API_KEY and network access. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving res_claude <- submit_anthropic_pairs_live( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, reasoning = "none", save_path = "results_seq.csv" ) # 2. Parallel execution (faster) res_par <- submit_anthropic_pairs_live( pairs = pairs, model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_par.csv", parallel = TRUE, workers = 4 ) # Inspect results head(res_par$results) # Check for failures if (nrow(res_par$failed_pairs) > 0) { message("Some pairs failed:") print(res_par$failed_pairs) } ## End(Not run)
This is a robust row-wise wrapper around gemini_compare_pair_live(). It
takes a tibble of pairs (ID1 / text1 / ID2 / text2), submits each
pair to the Google Gemini API, and collects the results.
submit_gemini_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, thinking_level = "low", temperature = NULL, top_p = NULL, top_k = NULL, max_output_tokens = NULL, api_version = "v1beta", verbose = TRUE, status_every = 1L, progress = TRUE, include_raw = FALSE, include_thoughts = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )submit_gemini_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, thinking_level = "low", temperature = NULL, top_p = NULL, top_k = NULL, max_output_tokens = NULL, api_version = "v1beta", verbose = TRUE, status_every = 1L, progress = TRUE, include_raw = FALSE, include_thoughts = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )
pairs |
Tibble/data frame with columns |
model |
Gemini model name (e.g. |
trait_name |
Trait name. |
trait_description |
Trait description. |
prompt_template |
Prompt template string, typically from
|
api_key |
Optional Gemini API key. |
thinking_level |
Default |
temperature |
Optional numeric temperature; forwarded to
|
top_p |
Optional numeric; forwarded to |
top_k |
Optional numeric; forwarded to |
max_output_tokens |
Optional integer; forwarded to
|
api_version |
API version; default |
verbose |
Logical; print status/timing every |
status_every |
Integer; how often to print status (default 1 = every pair). |
progress |
Logical; show a text progress bar. |
include_raw |
Logical; if |
include_thoughts |
Logical; if |
save_path |
Character string; optional file path (e.g., "output.csv")
to save results incrementally. If the file exists, the function reads it
to identify and skip pairs that have already been processed (resume mode).
Requires the |
parallel |
Logical; if |
workers |
Integer; the number of parallel workers (threads) to use if
|
... |
Reserved for future extensions; passed through to
|
This function offers:
Parallel Processing: Uses the future package to process
multiple pairs simultaneously.
Incremental Saving: Writes results to a CSV file as they complete.
If the process is interrupted, re-running the function with the same
save_path will automatically skip pairs that were already successfully processed.
Error Separation: Returns valid results and failed pairs separately, making it easier to debug or retry specific failures.
A list containing three elements:
A tibble with one row per successfully processed pair.
A tibble containing the rows from pairs that
failed to process (due to API errors or timeouts), along with an
error_message column.
A tibble of attempt-level failures (retries, timeouts, parse errors, invalid winners), separate from observed outcomes.
# Requires: # - GEMINI_API_KEY set in your environment # - Internet access # - Billable Gemini API usage ## Not run: # Example pair data pairs <- tibble::tibble( ID1 = c("S01", "S03"), text1 = c("Text 1", "Text 3"), ID2 = c("S02", "S04"), text2 = c("Text 2", "Text 4") ) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving res_seq <- submit_gemini_pairs_live( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_gemini_seq.csv" ) # 2. Parallel execution (faster) res_par <- submit_gemini_pairs_live( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_gemini_par.csv", parallel = TRUE, workers = 4 ) # 3. Gemini 3 Flash example (minimal thinking) res_flash <- submit_gemini_pairs_live( pairs = pairs, model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", save_path = "results_gemini_flash.csv" ) # Inspect results head(res_par$results) ## End(Not run)# Requires: # - GEMINI_API_KEY set in your environment # - Internet access # - Billable Gemini API usage ## Not run: # Example pair data pairs <- tibble::tibble( ID1 = c("S01", "S03"), text1 = c("Text 1", "Text 3"), ID2 = c("S02", "S04"), text2 = c("Text 2", "Text 4") ) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving res_seq <- submit_gemini_pairs_live( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_gemini_seq.csv" ) # 2. Parallel execution (faster) res_par <- submit_gemini_pairs_live( pairs = pairs, model = "gemini-3-pro-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_gemini_par.csv", parallel = TRUE, workers = 4 ) # 3. Gemini 3 Flash example (minimal thinking) res_flash <- submit_gemini_pairs_live( pairs = pairs, model = "gemini-3-flash-preview", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, thinking_level = "minimal", save_path = "results_gemini_flash.csv" ) # Inspect results head(res_par$results) ## End(Not run)
submit_llm_pairs() is a backend-neutral wrapper around row-wise comparison
for multiple pairs. It takes a tibble of pairs (ID1, text1, ID2,
text2), submits each pair to the selected backend, and aggregates the results.
submit_llm_pairs( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini", "together", "ollama"), endpoint = c("chat.completions", "responses"), api_key = NULL, verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )submit_llm_pairs( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), backend = c("openai", "anthropic", "gemini", "together", "ollama"), endpoint = c("chat.completions", "responses"), api_key = NULL, verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )
pairs |
Tibble or data frame with at least columns |
model |
Model identifier for the chosen backend. For |
trait_name |
Trait name to pass through to the backend-specific
comparison function (for example |
trait_description |
Full-text trait description passed to the backend. |
prompt_template |
Prompt template string, typically from
|
backend |
Character scalar indicating which LLM provider to use.
One of |
endpoint |
Character scalar specifying which endpoint family to use for
backends that support multiple live APIs. For the |
api_key |
Optional API key for the selected backend. If |
verbose |
Logical; if |
status_every |
Integer; print status and timing for every
|
progress |
Logical; if |
include_raw |
Logical; if |
save_path |
Character string; optional file path (e.g., "output.csv") to save results incrementally. If the file exists, the function reads it to identify and skip pairs that have already been processed (resume mode). Supported by all backends. |
parallel |
Logical; if |
workers |
Integer; the number of parallel workers (threads) to use if
|
... |
Additional backend-specific parameters. For |
This function supports parallel processing, incremental saving, and resume
capability for the "openai", "anthropic", "gemini", "together",
and "ollama" backends.
At present, the following backends are implemented:
"openai" → submit_openai_pairs_live()
"anthropic" → submit_anthropic_pairs_live()
"gemini" → submit_gemini_pairs_live()
"together" → submit_together_pairs_live()
"ollama" → submit_ollama_pairs_live()
A list containing:
A tibble with one row per successfully processed pair.
A tibble containing rows that failed to process (for supported backends).
A tibble containing normalized failure records (invalid winners, parse failures, HTTP/timeouts) suitable for debugging.
submit_openai_pairs_live(), submit_anthropic_pairs_live(),
submit_gemini_pairs_live(), submit_together_pairs_live(), and
submit_ollama_pairs_live() for backend-specific implementations.
## Not run: # Requires an API key for the chosen cloud backend. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Parallel execution with OpenAI (requires future package) res_live <- submit_llm_pairs( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "openai", endpoint = "chat.completions", parallel = TRUE, workers = 4, save_path = "results_openai.csv" ) # Live comparisons using a local Ollama backend with incremental saving res_ollama <- submit_llm_pairs( pairs = pairs, model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "ollama", save_path = "results_ollama.csv", verbose = TRUE ) # GPT-5 live comparisons with service tier res_gpt5 <- submit_llm_pairs( pairs = pairs, model = "gpt-5", trait_name = td$name, trait_description = td$description, backend = "openai", endpoint = "responses", reasoning = "none", service_tier = "flex" ) res_ollama$results ## End(Not run)## Not run: # Requires an API key for the chosen cloud backend. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Parallel execution with OpenAI (requires future package) res_live <- submit_llm_pairs( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "openai", endpoint = "chat.completions", parallel = TRUE, workers = 4, save_path = "results_openai.csv" ) # Live comparisons using a local Ollama backend with incremental saving res_ollama <- submit_llm_pairs( pairs = pairs, model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, backend = "ollama", save_path = "results_ollama.csv", verbose = TRUE ) # GPT-5 live comparisons with service tier res_gpt5 <- submit_llm_pairs( pairs = pairs, model = "gpt-5", trait_name = td$name, trait_description = td$description, backend = "openai", endpoint = "responses", reasoning = "none", service_tier = "flex" ) res_ollama$results ## End(Not run)
submit_ollama_pairs_live() is a robust row-wise wrapper around
ollama_compare_pair_live(). It takes a tibble of pairs (ID1 / text1 /
ID2 / text2), submits each pair to a local (or remote) Ollama server,
and collects the results.
submit_ollama_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), host = getOption("pairwiseLLM.ollama_host", "http://127.0.0.1:11434"), verbose = TRUE, status_every = 1, progress = TRUE, think = FALSE, num_ctx = 8192L, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )submit_ollama_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), host = getOption("pairwiseLLM.ollama_host", "http://127.0.0.1:11434"), verbose = TRUE, status_every = 1, progress = TRUE, think = FALSE, num_ctx = 8192L, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )
pairs |
Tibble or data frame with at least columns |
model |
Ollama model name (for example |
trait_name |
Trait name to pass to |
trait_description |
Trait description to pass to
|
prompt_template |
Prompt template string, typically from
|
host |
Base URL of the Ollama server. Defaults to the option
|
verbose |
Logical; if |
status_every |
Integer; print status and timing for every
|
progress |
Logical; if |
think |
Logical; see |
num_ctx |
Integer; context window to use via |
include_raw |
Logical; if |
save_path |
Character string; optional file path (e.g., "output.csv")
to save results incrementally. If the file exists, the function reads it
to identify and skip pairs that have already been processed (resume mode).
Requires the |
parallel |
Logical; if |
workers |
Integer; the number of parallel workers (threads) to use if
|
... |
Reserved for future extensions and forwarded to
|
This function offers:
Incremental Saving: Writes results to a CSV file as they complete.
If the process is interrupted, re-running the function with the same
save_path will automatically skip pairs that were already successfully processed.
Parallel Processing: Uses the future package to process
multiple pairs simultaneously. Note: Since Ollama typically runs
locally on the GPU, parallel processing may degrade performance or cause
out-of-memory errors unless the hardware can handle concurrent requests.
Defaults are set to sequential processing.
Temperature and context length are controlled as follows:
By default, temperature = 0 for all models.
For Qwen models (model names beginning with "qwen") and think = TRUE,
temperature is set to 0.6.
The context window is set via options$num_ctx, which defaults to
8192 but may be overridden via the num_ctx argument.
In most user-facing workflows, it is more convenient to call
submit_llm_pairs() with backend = "ollama" rather than using
submit_ollama_pairs_live() directly.
As with ollama_compare_pair_live(), this function assumes that:
An Ollama server is running and reachable at host.
The requested models have been pulled in advance (for example
ollama pull mistral-small3.2:24b).
A list containing three elements:
A tibble with one row per successfully processed pair.
A tibble containing the rows from pairs that
failed to process (due to API errors or timeouts), along with an
error_message column.
A tibble of attempt-level failures (retries, timeouts, parse errors, invalid winners), separate from observed outcomes.
ollama_compare_pair_live() for single-pair Ollama comparisons.
submit_llm_pairs() for backend-agnostic comparisons over tibbles of
pairs.
## Not run: # Requires a running Ollama server and locally available models. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Live comparisons with incremental saving res_mistral <- submit_ollama_pairs_live( pairs = pairs, model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "ollama_results.csv", verbose = TRUE ) # Access results res_mistral$results ## End(Not run)## Not run: # Requires a running Ollama server and locally available models. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 5, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Live comparisons with incremental saving res_mistral <- submit_ollama_pairs_live( pairs = pairs, model = "mistral-small3.2:24b", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "ollama_results.csv", verbose = TRUE ) # Access results res_mistral$results ## End(Not run)
This is a robust row-wise wrapper around
openai_compare_pair_live. It takes a tibble of pairs
(ID1 / text1 / ID2 / text2), submits each pair to the OpenAI API, and
collects the results.
submit_openai_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), endpoint = c("chat.completions", "responses"), api_key = NULL, verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )submit_openai_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), endpoint = c("chat.completions", "responses"), api_key = NULL, verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )
pairs |
Tibble or data frame with at least columns |
model |
OpenAI model name (for example "gpt-4.1", "gpt-5.1"). |
trait_name |
Trait name to pass to |
trait_description |
Trait description to pass to
|
prompt_template |
Prompt template string, typically from
|
endpoint |
Which OpenAI endpoint to target. One of
|
api_key |
Optional OpenAI API key. |
verbose |
Logical; if TRUE, prints status, timing, and result summaries. |
status_every |
Integer; print status / timing for every
|
progress |
Logical; if TRUE, shows a textual progress bar. |
include_raw |
Logical; if TRUE, each row of the returned tibble will
include a |
save_path |
Character string; optional file path (e.g., "output.csv")
to save results incrementally. If the file exists, the function reads it
to identify and skip pairs that have already been processed (resume mode).
Requires the |
parallel |
Logical; if TRUE, enables parallel processing using
|
workers |
Integer; the number of parallel workers (threads) to use if
|
... |
Additional OpenAI parameters (temperature, top_p, logprobs,
reasoning, service_tier, and so on) passed on to
|
This function improves upon simple looping by offering:
Parallel Processing: Uses the future package to process
multiple pairs simultaneously.
Incremental Saving: Writes results to a CSV file as they complete.
If the process is interrupted, re-running the function with the same
save_path will automatically skip pairs that were already successfully processed.
Error Separation: Returns valid results and failed pairs separately, making it easier to debug or retry specific failures.
A list containing three elements:
A tibble with one row per successfully processed pair and
columns such as better_id, better_sample, thoughts,
and content. See openai_compare_pair_live for
details.
A tibble containing the rows from pairs that
failed to process (due to API errors or timeouts), along with an
error_message column. These can be easily re-submitted.
A tibble of attempt-level failures (retries, timeouts, parse errors, invalid winners), separate from observed outcomes.
## Not run: # Requires API key set and internet access data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 10, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving # If interrupted, running this again will resume progress. res_seq <- submit_openai_pairs_live( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_seq.csv" ) # 2. Parallel execution (faster) # Note: On Windows, this opens background R sessions. res_par <- submit_openai_pairs_live( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, save_path = "results_par.csv", parallel = TRUE, workers = 4 ) # Inspect results head(res_par$results) # Check for failures if (nrow(res_par$failed_pairs) > 0) { message("Some pairs failed:") print(res_par$failed_pairs) } # 3. GPT-5 live run with service tier (Responses endpoint) res_gpt5 <- submit_openai_pairs_live( pairs = pairs, model = "gpt-5", trait_name = td$name, trait_description = td$description, endpoint = "responses", reasoning = "none", service_tier = "priority" ) ## End(Not run)## Not run: # Requires API key set and internet access data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 10, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving # If interrupted, running this again will resume progress. res_seq <- submit_openai_pairs_live( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_seq.csv" ) # 2. Parallel execution (faster) # Note: On Windows, this opens background R sessions. res_par <- submit_openai_pairs_live( pairs = pairs, model = "gpt-4.1", trait_name = td$name, trait_description = td$description, save_path = "results_par.csv", parallel = TRUE, workers = 4 ) # Inspect results head(res_par$results) # Check for failures if (nrow(res_par$failed_pairs) > 0) { message("Some pairs failed:") print(res_par$failed_pairs) } # 3. GPT-5 live run with service tier (Responses endpoint) res_gpt5 <- submit_openai_pairs_live( pairs = pairs, model = "gpt-5", trait_name = td$name, trait_description = td$description, endpoint = "responses", reasoning = "none", service_tier = "priority" ) ## End(Not run)
submit_together_pairs_live() is a robust row-wise wrapper around
together_compare_pair_live(). It takes a tibble of pairs (ID1, text1,
ID2, text2), submits each pair to the Together.ai Chat Completions API,
and collects the results.
submit_together_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )submit_together_pairs_live( pairs, model, trait_name, trait_description, prompt_template = set_prompt_template(), api_key = NULL, verbose = TRUE, status_every = 1, progress = TRUE, include_raw = FALSE, save_path = NULL, parallel = FALSE, workers = 1, ... )
pairs |
Tibble or data frame with at least columns |
model |
Together.ai model name, for example |
trait_name |
Trait name to pass to |
trait_description |
Trait description to pass to
|
prompt_template |
Prompt template string, typically from
|
api_key |
Optional Together.ai API key. If |
verbose |
Logical; if |
status_every |
Integer; print status / timing for every
|
progress |
Logical; if |
include_raw |
Logical; if |
save_path |
Character string; optional file path (e.g., "output.csv")
to save results incrementally. If the file exists, the function reads it
to identify and skip pairs that have already been processed (resume mode).
Requires the |
parallel |
Logical; if |
workers |
Integer; the number of parallel workers (threads) to use if
|
... |
Additional Together.ai parameters, such as |
This function improves upon simple looping by offering:
Parallel Processing: Uses the future package to process
multiple pairs simultaneously.
Incremental Saving: Writes results to a CSV file as they complete.
If the process is interrupted, re-running the function with the same
save_path will automatically skip pairs that were already successfully processed.
Error Separation: Returns valid results and failed pairs separately, making it easier to debug or retry specific failures.
A list containing three elements:
A tibble with one row per successfully processed pair and
columns such as better_id, better_sample, thoughts, and content.
A tibble containing the rows from pairs that failed
to process (due to API errors or timeouts), along with an
error_message column. These can be easily re-submitted.
A tibble of attempt-level failures (retries, timeouts, parse errors, invalid winners), separate from observed outcomes.
## Not run: # Requires TOGETHER_API_KEY and network access. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 10, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving # If interrupted, running this again will resume progress. res_seq <- submit_together_pairs_live( pairs = pairs, model = "deepseek-ai/DeepSeek-R1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_seq.csv" ) # 2. Parallel execution (faster) # Note: On Windows, this opens background R sessions. res_par <- submit_together_pairs_live( pairs = pairs, model = "deepseek-ai/DeepSeek-R1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_par.csv", parallel = TRUE, workers = 4 ) # Inspect results head(res_par$results) # Check for failures if (nrow(res_par$failed_pairs) > 0) { message("Some pairs failed:") print(res_par$failed_pairs) } ## End(Not run)## Not run: # Requires TOGETHER_API_KEY and network access. data("example_writing_samples", package = "pairwiseLLM") pairs <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 10, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # 1. Sequential execution with incremental saving # If interrupted, running this again will resume progress. res_seq <- submit_together_pairs_live( pairs = pairs, model = "deepseek-ai/DeepSeek-R1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_seq.csv" ) # 2. Parallel execution (faster) # Note: On Windows, this opens background R sessions. res_par <- submit_together_pairs_live( pairs = pairs, model = "deepseek-ai/DeepSeek-R1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, save_path = "results_par.csv", parallel = TRUE, workers = 4 ) # Inspect results head(res_par$results) # Check for failures if (nrow(res_par$failed_pairs) > 0) { message("Some pairs failed:") print(res_par$failed_pairs) } ## End(Not run)
Summarize an adaptive state.
summarize_adaptive(state)summarize_adaptive(state)
state |
Adaptive state. |
Returns a compact run-level summary from canonical logs: attempted steps, committed comparisons, refit count, and last stop decision/reason. This is a pure view and does not recompute model quantities.
A one-row tibble with columns n_items,
steps_attempted, committed_pairs, n_refits,
last_stop_decision, and last_stop_reason.
adaptive_get_logs(), base::print()
Other adaptive ranking:
adaptive_rank(),
adaptive_rank_resume(),
adaptive_rank_run_live(),
adaptive_rank_start(),
make_adaptive_judge_llm()
state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) summarize_adaptive(state)state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) summarize_adaptive(state)
This helper takes the object returned by fit_bt_model and
returns a tibble with one row per object (e.g., writing sample), including:
ID: object identifier
theta: estimated ability parameter
se: standard error of theta
rank: rank order of theta (1 = highest by default)
engine: modeling engine used ("sirt" or "BradleyTerry2")
reliability: MLE reliability (for sirt) or NA
summarize_bt_fit(fit, decreasing = TRUE, verbose = TRUE)summarize_bt_fit(fit, decreasing = TRUE, verbose = TRUE)
fit |
A list returned by |
decreasing |
Logical; should higher |
verbose |
Logical. If |
A tibble with columns:
Object identifier.
Estimated ability parameter.
Standard error of theta.
Rank of theta; 1 = highest
(if decreasing = TRUE).
Modeling engine used ("sirt" or "BradleyTerry2").
MLE reliability (numeric scalar) repeated on each row.
# Example using built-in comparison data data("example_writing_pairs") bt <- build_bt_data(example_writing_pairs) fit1 <- fit_bt_model(bt, engine = "sirt") fit2 <- fit_bt_model(bt, engine = "BradleyTerry2") summarize_bt_fit(fit1) summarize_bt_fit(fit2)# Example using built-in comparison data data("example_writing_pairs") bt <- build_bt_data(example_writing_pairs) fit1 <- fit_bt_model(bt, engine = "sirt") fit2 <- fit_bt_model(bt, engine = "BradleyTerry2") summarize_bt_fit(fit1) summarize_bt_fit(fit2)
Build an item-level diagnostics summary from the canonical item logs. This is a pure view and does not recompute posterior quantities or exposure metrics.
summarize_items( state, posterior = NULL, refit = NULL, bind = FALSE, top_n = NULL, sort_by = c("rank_mean", "theta_mean", "theta_sd", "degree", "pos_A_rate"), include_optional = TRUE )summarize_items( state, posterior = NULL, refit = NULL, bind = FALSE, top_n = NULL, sort_by = c("rank_mean", "theta_mean", "theta_sd", "degree", "pos_A_rate"), include_optional = TRUE )
state |
An |
posterior |
Optional |
refit |
Optional refit index. When |
bind |
Logical; when |
top_n |
Optional positive integer; return only the top |
sort_by |
Column used for sorting. Defaults to |
include_optional |
Logical; include optional diagnostic columns. |
Rank percentiles are computed from the per-draw induced ranks (lower is
better). Rank uncertainty grows when draws disagree on the ordering. Degree
and position exposure metrics summarize how frequently each item was shown
and whether it appeared as the first option (A position). When
refit = NULL, the most recent refit is returned; when
refit = k, the k-th refit is returned. When bind = TRUE,
all refits are stacked into a single table and refit must be
NULL.
A tibble with one row per item per refit. Columns reflect the
canonical item log schema (for example refit_id, ID,
theta_mean, rank_mean, deg, and posA_prop).
Rank percentiles summarize per-draw induced ranks (lower is better). When
include_optional = FALSE, optional columns such as repeated-pair or
adjacency diagnostics are dropped if present.
# summarize_items() expects an item_log_list (list of per-refit item tables). # This example constructs a minimal logs object that matches what adaptive runs emit. item_log_1 <- tibble::tibble( refit_id = 1L, ID = c("A", "B", "C"), theta_mean = c(0.4, 0.1, -0.2), theta_sd = c(0.2, 0.3, 0.25), rank_mean = c(1.2, 2.1, 2.7), degree = c(10L, 9L, 8L), pos_A_rate = c(0.55, 0.50, 0.48) ) item_log_2 <- dplyr::mutate( item_log_1, refit_id = 2L, theta_mean = theta_mean + c(0.1, 0.05, 0.02), rank_mean = rank_mean + c(-0.1, 0.0, 0.1) ) logs <- list(item_log_list = list(item_log_1, item_log_2)) # Default returns the most recent refit: summarize_items(logs) # Select a specific refit: summarize_items(logs, refit = 1) # Stack all refits into one table: summarize_items(logs, bind = TRUE) # Sort and take the top rows: summarize_items(logs, sort_by = "rank_mean", top_n = 2)# summarize_items() expects an item_log_list (list of per-refit item tables). # This example constructs a minimal logs object that matches what adaptive runs emit. item_log_1 <- tibble::tibble( refit_id = 1L, ID = c("A", "B", "C"), theta_mean = c(0.4, 0.1, -0.2), theta_sd = c(0.2, 0.3, 0.25), rank_mean = c(1.2, 2.1, 2.7), degree = c(10L, 9L, 8L), pos_A_rate = c(0.55, 0.50, 0.48) ) item_log_2 <- dplyr::mutate( item_log_1, refit_id = 2L, theta_mean = theta_mean + c(0.1, 0.05, 0.02), rank_mean = rank_mean + c(-0.1, 0.0, 0.1) ) logs <- list(item_log_list = list(item_log_1, item_log_2)) # Default returns the most recent refit: summarize_items(logs) # Select a specific refit: summarize_items(logs, refit = 1) # Stack all refits into one table: summarize_items(logs, bind = TRUE) # Sort and take the top rows: summarize_items(logs, sort_by = "rank_mean", top_n = 2)
Build a thin per-refit diagnostics summary from the adaptive round log. This is
a pure view over round_log and does not recompute posterior
quantities or stop metrics.
summarize_refits(state, last_n = NULL, include_optional = TRUE)summarize_refits(state, last_n = NULL, include_optional = TRUE)
state |
An |
last_n |
Optional positive integer; return only the last |
include_optional |
Logical; include optional diagnostic columns. |
The round log is the canonical stop-audit trail. This summary is a direct
view over round_log with no recomputation.
Key fields include:
identity: refit_id, round_id_at_refit,
step_id_at_refit
run scale: total_pairs_done, new_pairs_since_last_refit,
n_unique_pairs_seen
candidate health: proposed_pairs_mode,
starve_rate_since_last_refit, fallback_rate_since_last_refit,
fallback_used_mode, starvation_reason_mode
identifiability/quota adaptation: global_identified,
global_identified_reliability_min,
global_identified_rank_corr_min, long_quota_raw,
long_quota_effective, long_quota_removed,
realloc_to_mid, realloc_to_local
diagnostics/stopping: diagnostics_pass, divergences,
max_rhat, min_ess_bulk, ess_bulk_required,
reliability_EAP, rho_theta, delta_sd_theta,
rho_rank, stop_decision, stop_reason
report-only uncertainty metrics: ci95_theta_width_*,
near_tie_adj_*, cov_trace_theta,
top20_boundary_entropy_*, nn_diff_sd_*
A tibble with one row per refit (canonical round_log schema).
# These summaries work on either an adaptive_state or a plain list of logs. logs <- list( round_log = tibble::tibble( refit_id = 1:2, round_id_at_refit = c(1L, 2L), step_id_at_refit = c(10L, 20L), new_pairs_since_last_refit = c(50L, 50L), total_pairs_done = c(50L, 100L), divergences = c(0L, 0L), max_rhat = c(1.01, 1.00), min_ess_bulk = c(800, 900), stop_decision = c(NA, TRUE), stop_reason = c(NA_character_, "btl_converged") ) ) # Full per-refit view: summarize_refits(logs) # Only the most recent refit row: summarize_refits(logs, last_n = 1) # Drop optional diagnostics if you want a compact core summary: summarize_refits(logs, include_optional = FALSE)# These summaries work on either an adaptive_state or a plain list of logs. logs <- list( round_log = tibble::tibble( refit_id = 1:2, round_id_at_refit = c(1L, 2L), step_id_at_refit = c(10L, 20L), new_pairs_since_last_refit = c(50L, 50L), total_pairs_done = c(50L, 100L), divergences = c(0L, 0L), max_rhat = c(1.01, 1.00), min_ess_bulk = c(800, 900), stop_decision = c(NA, TRUE), stop_reason = c(NA_character_, "btl_converged") ) ) # Full per-refit view: summarize_refits(logs) # Only the most recent refit row: summarize_refits(logs, last_n = 1) # Drop optional diagnostics if you want a compact core summary: summarize_refits(logs, include_optional = FALSE)
together_compare_pair_live() sends a single pairwise comparison prompt to
the Together.ai Chat Completions API (/v1/chat/completions) and parses the
result into a small tibble. It is the Together.ai analogue of
openai_compare_pair_live() and uses the same prompt template and tag
conventions (for example <BETTER_SAMPLE>...</BETTER_SAMPLE>).
together_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", api_key = NULL, include_raw = FALSE, ... )together_compare_pair_live( ID1, text1, ID2, text2, model, trait_name, trait_description, prompt_template = set_prompt_template(), tag_prefix = "<BETTER_SAMPLE>", tag_suffix = "</BETTER_SAMPLE>", api_key = NULL, include_raw = FALSE, ... )
ID1 |
Character ID for the first sample. |
text1 |
Character string containing the first sample's text. |
ID2 |
Character ID for the second sample. |
text2 |
Character string containing the second sample's text. |
model |
Together.ai model name (for example
|
trait_name |
Short label for the trait (for example "Overall Quality"). |
trait_description |
Full-text definition of the trait. |
prompt_template |
Prompt template string, typically from
|
tag_prefix |
Prefix for the better-sample tag. Defaults to
|
tag_suffix |
Suffix for the better-sample tag. Defaults to
|
api_key |
Optional Together.ai API key. If |
include_raw |
Logical; if |
... |
Additional Together.ai parameters, typically including
|
For models such as "deepseek-ai/DeepSeek-R1" that emit internal reasoning
wrapped in <think>...</think> tags, this helper will:
Extract the <think>...</think> block into the thoughts column.
Remove the <think>...</think> block from the visible content
column, so content contains only the user-facing answer.
Other Together.ai models (for example "moonshotai/Kimi-K2-Instruct-0905",
"Qwen/Qwen3-235B-A22B-Instruct-2507-tput",
"deepseek-ai/DeepSeek-V3") are supported via the same API but may not use
<think> tags; in those cases, thoughts will be NA and the full model
output will appear in content.
Temperature handling:
If temperature is not supplied in ..., the function applies
backend defaults:
"deepseek-ai/DeepSeek-R1" → temperature = 0.6.
All other models → temperature = 0.
If temperature is included in ..., that value is used and the
defaults are not applied.
A tibble with one row and columns:
Stable ID for the pair (pair_uid if supplied via
...; otherwise "LIVE_<ID1>_vs_<ID2>").
The sample IDs you supplied.
Model name reported by the API.
API object type, typically "chat.completion".
HTTP-style status code (200 if successful).
Error message if something goes wrong; otherwise NA.
Internal reasoning text, for example <think>...</think>
blocks from models like "deepseek-ai/DeepSeek-R1".
Concatenated visible assistant output (without <think>
blocks).
"SAMPLE_1", "SAMPLE_2", or NA, based on the
<BETTER_SAMPLE> tag.
ID1 if "SAMPLE_1" is chosen, ID2 if "SAMPLE_2" is
chosen, otherwise NA.
Prompt / input token count (if reported).
Completion / output token count (if reported).
Total token count (if reported).
(Optional) list-column containing the parsed JSON body.
## Not run: # Requires TOGETHER_API_KEY set in your environment and network access. data("example_writing_samples", package = "pairwiseLLM") samples <- example_writing_samples[1:2, ] td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Example: DeepSeek-R1 with default temperature = 0.6 if not supplied res_deepseek <- together_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "deepseek-ai/DeepSeek-R1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) res_deepseek$better_id res_deepseek$thoughts # Example: Kimi-K2 with default temperature = 0 unless overridden res_kimi <- together_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "moonshotai/Kimi-K2-Instruct-0905", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) res_kimi$better_id ## End(Not run)## Not run: # Requires TOGETHER_API_KEY set in your environment and network access. data("example_writing_samples", package = "pairwiseLLM") samples <- example_writing_samples[1:2, ] td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Example: DeepSeek-R1 with default temperature = 0.6 if not supplied res_deepseek <- together_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "deepseek-ai/DeepSeek-R1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) res_deepseek$better_id res_deepseek$thoughts # Example: Kimi-K2 with default temperature = 0 unless overridden res_kimi <- together_compare_pair_live( ID1 = samples$ID[1], text1 = samples$text[1], ID2 = samples$ID[2], text2 = samples$text[2], model = "moonshotai/Kimi-K2-Instruct-0905", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) res_kimi$better_id ## End(Not run)
This helper returns both a short display name and a longer
description for a scoring trait. These can be inserted into
the prompt template via the {TRAIT_NAME} and
{TRAIT_DESCRIPTION} placeholders.
trait_description( name = c("overall_quality", "organization"), custom_name = NULL, custom_description = NULL )trait_description( name = c("overall_quality", "organization"), custom_name = NULL, custom_description = NULL )
name |
Character identifier for a built-in trait. One of
|
custom_name |
Optional short label to use when supplying a
|
custom_description |
Optional full-text definition of a
custom trait. When supplied, built-in |
A list with two elements:
Short display label for the trait (e.g., "Overall Quality").
Full-text definition of the trait, suitable for inclusion in the prompt.
td <- trait_description("overall_quality") td$name td$description custom_td <- trait_description( custom_name = "Ideas", custom_description = "Quality and development of ideas in the writing." ) custom_td$name custom_td$descriptiontd <- trait_description("overall_quality") td$name td$description custom_td <- trait_description( custom_name = "Ideas", custom_description = "Quality and development of ideas in the writing." ) custom_td$name custom_td$description
Validate an adaptive session directory.
validate_session_dir(session_dir)validate_session_dir(session_dir)
session_dir |
Directory containing session artifacts. |
Verifies that required session artifacts exist and that serialized logs match
canonical schemas for step_log and round_log. This check is
intended as a preflight for load_adaptive_session() and enforces the
canonical adaptive session metadata shape. Validation is strict:
added/removed/reordered columns in persisted logs are treated as schema
incompatibilities and abort resume.
A metadata list containing at least schema_version,
package_version, and n_items.
save_adaptive_session(), load_adaptive_session()
Other adaptive persistence:
load_adaptive_session(),
save_adaptive_session()
dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) save_adaptive_session(state, dir, overwrite = TRUE) validate_session_dir(dir)dir <- tempfile("pwllm-session-") state <- adaptive_rank_start(c("a", "b", "c"), seed = 1) save_adaptive_session(state, dir, overwrite = TRUE) validate_session_dir(dir)
This helper takes the output of build_openai_batch_requests
(or a compatible table) and writes one JSON object per line, in the
format expected by the OpenAI batch API.
write_openai_batch_file(batch_tbl, path)write_openai_batch_file(batch_tbl, path)
batch_tbl |
A data frame or tibble, typically the result of
|
path |
File path where the JSONL file should be written. |
The input can either:
Already contain a character column jsonl (one JSON string
per row), in which case that column is used directly, or
Contain the columns custom_id, method,
url, and body, in which case the JSON strings are
constructed automatically.
Invisibly returns path.
# Construct a minimal batch request tibble requests <- tibble::tibble( custom_id = c("req1", "req2"), method = "POST", url = "/v1/chat/completions", body = list( list( model = "gpt-4o-mini", messages = list( list(role = "user", content = "Hello") ) ), list( model = "gpt-4o-mini", messages = list( list(role = "user", content = "Goodbye") ) ) ) ) # Write to a temporary JSONL file path <- tempfile(fileext = ".jsonl") write_openai_batch_file(requests, path) # Inspect the file contents readLines(path)# Construct a minimal batch request tibble requests <- tibble::tibble( custom_id = c("req1", "req2"), method = "POST", url = "/v1/chat/completions", body = list( list( model = "gpt-4o-mini", messages = list( list(role = "user", content = "Hello") ) ), list( model = "gpt-4o-mini", messages = list( list(role = "user", content = "Goodbye") ) ) ) ) # Write to a temporary JSONL file path <- tempfile(fileext = ".jsonl") write_openai_batch_file(requests, path) # Inspect the file contents readLines(path)