GPT Batch Predictions
The BatchRunGPT class in cred-LLM-pipelines/batch_gpt.py provides a streamlined interface for running large-scale batch predictions through the OpenAI Batch API. It handles file chunking, job submission, status monitoring, result retrieval, and — importantly — cost estimation before committing to a run.
Repository: cred-LLM-pipelines
File: batch_gpt.py
Supported Models
| Model | Endpoint | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|---|
gpt-5-mini |
/v1/responses |
$0.125 | $1.00 |
gpt-4o-mini |
/v1/chat/completions |
$0.075 | $0.30 |
Note
These are the Batch API prices, which are lower than the real-time API prices. The costs are defined as class-level constants in BatchRunGPT.
Quick Start
from batch_gpt import BatchRunGPT
batch_run = BatchRunGPT(
data=df, # DataFrame with an id column and a 'prompt_col' column
system_prompt=prompt, # system instructions for the model
id_col="company_id", # column used as unique identifier per row
job_name="my_job", # descriptive name — used for saving/loading batch IDs
)
batch_run.create_files() # writes .jsonl batch files locally
batch_run.create_job() # uploads files and submits batch jobs to OpenAI
# monitor progress (pass the batch index, e.g. 0)
batch_run.get_object_information(0)
# once status is 'completed'
results_df = batch_run.retrieve_results()
Input DataFrame Requirements
The DataFrame passed as data must contain:
- A column whose name matches
id_col— unique identifier per row (integer or string). - A column named
prompt_col— the user-facing text that will be sent to the model.
Cost Estimation
Running batch jobs on hundreds of thousands of rows can get expensive. BatchRunGPT provides a way to estimate costs before submitting a job.
Sample-Based Estimation
Pass estimate_cost=True when creating the class. This runs 30 real API calls on a random sample of your data, measures actual token usage, and extrapolates to the full dataset.
batch_run = BatchRunGPT(
data=df,
system_prompt=prompt,
id_col="company_id",
job_name="my_job",
gpt_model="gpt-5-mini",
max_tokens=100,
estimate_cost=True, # triggers cost estimation
)
What happens under the hood:
- A random sample of 30 rows is drawn from the DataFrame.
- Each row is sent as a real (non-batch) API call to the selected model.
- The actual input and output token counts are recorded.
- Average tokens per row are multiplied by the total row count and the per-token cost for the selected model.
- The estimated input and output costs are printed:
Estimating job cost by running on sample to count tokens...
This job is estimated to cost $1.23 for input and $0.45 for output
Estimating job cost by running on sample to count tokens...
This job is estimated to cost $1.23 for input and $0.45 for output
How It Works Internally
Batch Lifecycle
For context, here is the full lifecycle of a batch job — cost estimation is the optional first step:
graph LR
A["estimate_job_cost()"] -->|optional| B["create_files()"]
B --> C["create_job()"]
C --> D["get_object_information()"]
D --> E{Status?}
E -->|completed| F["retrieve_results()"]
E -->|in_progress| D
E -->|error| G["cancel_job()"]
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
data |
DataFrame |
— | Input data with id_col and prompt_col columns |
system_prompt |
str |
— | System instructions for the model |
id_col |
str |
— | Name of the unique identifier column |
job_name |
str |
— | Descriptive job name (used for saving batch IDs) |
gpt_model |
str |
"gpt-5-mini" |
Model to use (gpt-5-mini or gpt-4o-mini) |
max_tokens |
int |
100 |
Max output tokens per response |
batch_files_dir |
str |
"batch_files" |
Directory for local .jsonl batch files |
file_name |
str |
None |
Pickle file to reload batch IDs from a previous run |
big_input |
bool |
False |
Reduces batch size from 50k to 15k rows per file for large prompts |
estimate_cost |
bool |
False |
Run sample-based cost estimation before job creation |
Key Methods
| Method | Description |
|---|---|
create_files() |
Builds .jsonl files split into chunks of up to 50k rows |
create_job() |
Uploads files to OpenAI and creates batch objects |
get_object_information(batch_num) |
Returns the status of a specific batch |
get_completion_time(batch_num) |
Prints how long a completed batch took |
retrieve_results(ignore_incomplete) |
Downloads and merges results into the original DataFrame |
cancel_job() |
Cancels all submitted batches |
get_running_batches() |
Lists batches still in progress |
estimate_job_cost() |
Runs sample-based cost estimation (called automatically if estimate_cost=True) |
Resuming a Previous Job
If you need to check on or retrieve results from a previously submitted job, pass the saved pickle file name:
batch_run = BatchRunGPT(
data=df,
system_prompt=prompt,
id_col="company_id",
job_name="my_job",
file_name="batch_ids_my_job.pkl", # saved automatically during create_job()
)
# check status
batch_run.get_object_information(0)
# retrieve when done
results_df = batch_run.retrieve_results()