--- title: "Getting Started with pairwiseLLM" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with pairwiseLLM} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) library(pairwiseLLM) library(dplyr) ``` # 1. Introduction `pairwiseLLM` provides a unified workflow for generating and analyzing **pairwise comparisons of writing quality** using LLM APIs (OpenAI, Anthropic, Gemini, Together), and local models via Ollama.. A typical workflow: 1. Select writing samples 2. Construct pairwise comparison sets 3. Submit comparisons to an LLM (live or batch API) 4. Parse model outputs 5. Fit Bradley–Terry or Elo models to obtain latent writing-quality scores For prompt evaluation and positional-bias diagnostics, see: * [`vignette("prompt-template-bias")`](https://shmercer.github.io/pairwiseLLM/articles/prompt-template-bias.html) For advanced batch processing workflows, see: * [`vignette("advanced-batch-workflows")`](https://shmercer.github.io/pairwiseLLM/articles/advanced-batch-workflows.html) --- # 2. Setting API Keys `pairwiseLLM` reads provider keys **only from environment variables**, never from R options or global variables. | Provider | Environment Variable | |----------|----------------------| | [OpenAI](https://openai.com/api/) | OPENAI_API_KEY | | [Anthropic](https://console.anthropic.com/)| ANTHROPIC_API_KEY | | [Gemini](https://aistudio.google.com/) | GEMINI_API_KEY | | [Together](https://www.together.ai/) | TOGETHER_API_KEY | You should put these in your `~/.Renviron`: ``` OPENAI_API_KEY="sk-..." ANTHROPIC_API_KEY="..." GEMINI_API_KEY="..." TOGETHER_API_KEY="..." ``` Check which keys are available: ``` library(pairwiseLLM) check_llm_api_keys() #> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY. #> # A tibble: 4 × 4 #> backend service env_var has_key #> 1 openai OpenAI OPENAI_API_KEY TRUE #> 2 anthropic Anthropic ANTHROPIC_API_KEY TRUE #> 3 gemini Google Gemini GEMINI_API_KEY TRUE #> 4 together Together.ai TOGETHER_API_KEY TRUE ``` [Ollama](https://ollama.com/) runs locally and does not require an API key, just that the Ollama server is running. --- # 3. Example Writing Data The package ships with 20 simulated student writing samples with clear differences in quality: ```{r} data("example_writing_samples", package = "pairwiseLLM") dplyr::slice_head(example_writing_samples, n = 3) ``` Each sample has: - `ID` - `text` --- # 4. Constructing Pairwise Comparisons Create all unordered pairs: ```{r} pairs <- example_writing_samples |> make_pairs() dplyr::slice_head(pairs, n = 5) ``` Sample a subset of pairs: ```{r} pairs_small <- sample_pairs(pairs, n_pairs = 10, seed = 123) ``` Randomize SAMPLE_1 / SAMPLE_2 order: ```{r} pairs_small <- randomize_pair_order(pairs_small, seed = 99) ``` --- # 5. Traits and Prompt Templates ## 5.1 Using a built-in trait ```{r} td <- trait_description("overall_quality") td ``` Or define your own: ```{r} td_custom <- trait_description( custom_name = "Clarity", custom_description = "How clearly and effectively ideas are expressed." ) ``` ## 5.2 Using or customizing prompt templates Load default prompt: ```{r} tmpl <- set_prompt_template() cat(substr(tmpl, 1, 300)) ``` Placeholders required in custom prompt templates: - `{TRAIT_NAME}` - `{TRAIT_DESCRIPTION}` - `{SAMPLE_1}` - `{SAMPLE_2}` Load a template from file: ```{r, eval=FALSE} set_prompt_template(file = "my_template.txt") ``` --- # 6. Live Pairwise Comparisons The unified wrapper works for **OpenAI, Anthropic, Gemini, Together, and Ollama.** It supports **parallel processing** and **incremental output file saving** (resume capability) for **all** supported backends. The function returns a list containing: - `$results`: observed outcomes only (canonical schema) - `$failed_pairs`: scheduled pairs with no observed outcome - `$failed_attempts`: attempt-level failures (retries, timeouts, parse errors, invalid winners) ```{r, eval=FALSE} # Example using parallel processing and incremental saving res_list <- submit_llm_pairs( pairs = pairs_small, backend = "openai", # also "anthropic", "gemini", "together" model = "gpt-4o", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, # New features: parallel = TRUE, workers = 4, save_path = "live_results.csv" ) ``` Preview results: ```{r, eval=FALSE} # Successes are in the $results tibble dplyr::slice_head(res_list$results, 5) # Failures (if any) are in $failed_pairs if (nrow(res_list$failed_pairs) > 0) { print(res_list$failed_pairs) } # Attempt-level failures (if any) are in $failed_attempts if (nrow(res_list$failed_attempts) > 0) { print(res_list$failed_attempts) } ``` Each row in `$results` includes: - `custom_id` (uses `pair_uid` if supplied; otherwise defaults to `LIVE__vs_`) - `ID1`, `ID2` - parsed `` tag → `better_sample` and `better_id` - canonical aliases/keys: `A_id`, `B_id`, `winner_pos`, `ordered_key`, `unordered_key`, `pair_uid`, `received_at`, `backend`, `model` - thoughts (reasoning text, if available) and content (final answer) --- # 7. Preparing Data for BT or Elo Modeling Convert the LLM output (specifically the `$results` tibble for `submit_llm_pairs()` output) to a 3-column BT dataset: ```{r, eval=FALSE} # res_list: output list from submit_llm_pairs() # We extract the $results tibble for modeling bt_data <- build_bt_data(res_list$results) dplyr::slice_head(bt_data, 5) ``` and/or a dataset for Elo modeling: ```{r, eval=FALSE} # res_list: output from submit_llm_pairs() elo_data <- build_elo_data(res_list$results) ``` --- # 8. Bradley–Terry Modeling Fit model: ```{r, eval=FALSE} bt_fit <- fit_bt_model(bt_data) ``` Summarize results: ```{r, eval=FALSE} summarize_bt_fit(bt_fit) ``` The output includes: - latent θ ability scores - SEs - reliability (when using `sirt` engine) --- # 9. Elo Modeling ```{r, eval=FALSE} elo_fit <- fit_elo_model(elo_data, runs = 5) elo_fit ``` Outputs: - Elo ratings for each sample - unweighted and weighted reliability - trial counts --- # 10. Batch APIs (Large Jobs) ## 10.1 Submit a batch ```{r, eval=FALSE} batch <- llm_submit_pairs_batch( backend = "openai", model = "gpt-4o", pairs = pairs_small, trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) ``` ## 10.2 Download results ```{r, eval=FALSE} res_batch <- llm_download_batch_results(batch) head(res_batch) ``` ## 10.3 Multi‑Batch Jobs In addition to the standard batch helpers, you can split a large job into multiple segments using `llm_submit_pairs_multi_batch()` and then poll all of them with `llm_resume_multi_batches()`. This is particularly useful when you have many pairs or want to ensure that you can resume if the session ends. ``` {r, eval=FALSE} # Generate a small set of pairs pairs_small <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 10, seed = 4321) |> randomize_pair_order(seed = 8765) td <- trait_description("overall_quality") tmpl <- set_prompt_template() # Split into two batches and include reasoning/chain-of-thought multi_job <- llm_submit_pairs_multi_batch( pairs = pairs_small, backend = "openai", model = "gpt-5.1", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, n_segments = 2, output_dir = "myjob", write_registry = TRUE, include_thoughts = TRUE ) # Poll and merge results. Combined results are written to # "myjob/combined_results.csv" or the directory you specify. res <- llm_resume_multi_batches( jobs = multi_job$jobs, interval_seconds = 30, write_combined_csv = TRUE ) head(res$combined) ``` ## 10.4 Estimating cost before you run For large jobs, it is often useful to estimate token usage and cost before launching a live run or submitting a batch. `pairwiseLLM` includes `estimate_llm_pairs_cost()`, which runs a small **pilot** (paid live calls) and then estimates the rest of the job by calibrating input tokens from prompt byte length. The output includes both: - **Expected cost** (using mean output tokens from the pilot) - **Budget cost** (using a high quantile of pilot output tokens, controlled by `budget_quantile`) If you are running a discounted batch workflow, set `mode = "batch"` and supply a `batch_discount` multiplier. ```{r, eval=FALSE} # Create a moderate set of pairs pairs_big <- example_writing_samples |> make_pairs() |> sample_pairs(n_pairs = 200, seed = 123) |> randomize_pair_order(seed = 456) td <- trait_description("overall_quality") tmpl <- set_prompt_template() est <- estimate_llm_pairs_cost( pairs = pairs_big, backend = "anthropic", # "openai", "anthropic", "gemini", "together" model = "claude-sonnet-4-5", trait_name = td$name, trait_description = td$description, prompt_template = tmpl, mode = "batch", batch_discount = 0.5, # set to 1 for no discount n_test = 10, # paid pilot calls (live) budget_quantile = 0.9, # p90 output tokens cost_per_million_input = 3.0, # fill in your provider pricing cost_per_million_output = 15.0 ) est$summary ``` Avoid paying twice: reuse pilot results The estimator returns both the pilot output and the pairs not included in the pilot (remaining_pairs). Use remaining_pairs to submit only the remaining work after you are satisfied with the estimate: ```{r, eval=FALSE} remaining_pairs <- est$remaining_pairs # Example: submit only the remaining pairs as a batch batch <- llm_submit_pairs_batch( backend = "anthropic", model = "claude-sonnet-4-5", pairs = remaining_pairs, trait_name = td$name, trait_description = td$description, prompt_template = tmpl) results <- llm_download_batch_results(batch) ``` Notes: - The estimator does not require a provider tokenizer; it uses prompt byte length calibrated on the pilot. - Ollama is not supported in the estimator (local models do not incur token costs). - Reasoning/thinking tokens are treated as output tokens for pricing. --- # 11. Backend-Specific Tools Most users use the unified interface, but backend helpers are available. ### 11.1 OpenAI - `submit_openai_pairs_live()` - `build_openai_batch_requests()` - `run_openai_batch_pipeline()` - `parse_openai_batch_output()` ### 11.2 Anthropic - `submit_anthropic_pairs_live()` - `build_anthropic_batch_requests()` - `run_anthropic_batch_pipeline()` - `parse_anthropic_batch_output()` ### 11.3 Google Gemini - `submit_gemini_pairs_live()` - `build_gemini_batch_requests()` - `run_gemini_batch_pipeline()` - `parse_gemini_batch_output()` ### 11.4 Together.ai (live only) - `together_compare_pair_live()` - `submit_together_pairs_live()` ### 11.5 Ollama (local, live only) - `ollama_compare_pair_live()` - `submit_ollama_pairs_live()` --- # 12. Troubleshooting ### Missing API keys ```{r} check_llm_api_keys() ``` ### Timeouts Use batch APIs for >40 pairs. Split a large job into multiple segments using `llm_submit_pairs_multi_batch()` and then poll/download all of them with `llm_resume_multi_batches()` ### Positional bias Use `compute_reverse_consistency()` + `check_positional_bias()` (see [vignette("prompt-template-bias")](https://shmercer.github.io/pairwiseLLM/articles/prompt-template-bias.html) for a full example). --- # 13. Citation > Mercer, S. H. (2025). *Getting started with pairwiseLLM* [R package vignette]. In *pairwiseLLM: Pairwise comparison tools for large language model-based writing evaluation*. https://doi.org/10.32614/CRAN.package.pairwiseLLM