“Keep it simple” they said.
When I was working on an agentic pipeline for a question-answering task on a time-series dataset, the common advice was to just dump the raw data into the language model’s context. In my opinion, that was a flawed approach. Dumping a stream of floating-point numbers into a prompt is a recipe for confusing the model and getting unreliable results.
So, in this tutorial, we’ll explore a more robust alternative. We will fine-tune a Vision Language Model (VLM), specifically Google’s Gemma 3, for a Visual Question Answering (VQA) task on chart data. We’ll be using the powerful Hugging Face ecosystem, including transformers
, datasets
, and trl
.
Install dependencies
First thing first let’s get the base libraries, and we are going to do this guide in pytorch. To install it:
!pip install -qq torch torchvision torchaudio
Next, we’ll need transformers for our models, datasets for data handling, bitsandbytes for quantization, peft for efficient fine-tuning, and accelerate to optimize it all.
!pip install -U -qq transformers trl datasets bitsandbytes peft accelerate
transformers
: Provides the VLM architecture (Gemma 3) and processor.datasets
: Allows us to easily load and manipulate the ChartQA dataset.bitsandbytes
: Enables model quantization (like 4-bit loading) to reduce memory usage.peft
: The Parameter-Efficient Fine-Tuning library, which contains the logic for LoRA/QLoRA.trl
: The Transformer Reinforcement Learning library, which simplifies the supervised fine-tuning process with itsSFTTrainer
.
Load and Prepare the Dataset
With our environment set up, it’s time to load our data. The ChartQA dataset is available on the Hugging Face Hub and is perfect for our task. It contains a wide variety of graphs along with corresponding question-answer pairs, presenting a solid challenge for a model’s VQA capabilities.
Since our goal is to build a conversational model, we need to format the dataset into a chat-like structure. I’ve crafted a system prompt to guide the model’s behavior:
="""You are a Vision Language Model that understands and
system_promptinterprets chart images. Your job is to look at the chart and answer
questions with short, clear responses—usually a single word, number,
or brief phrase. The charts may be line charts, bar charts, pie charts,
or others, and can include colors, labels, legends, and text. Focus on
giving accurate answers based only on what is shown in the image.
Do not explain your answer unless the question needs it to make sense.
"""
Referring to the chat template described in the Gemma 3 model card, we can create a formatting function that structures each sample into a conversation with system
, user
, and assistant
roles.
format_data(sample):
def for the model
# The system message sets the context = {
system_message "role": "system",
"content": [{"type": "text", "text": system_prompt}],
}
# The user message provides the image and the question= {
user_message "role": "user",
"content": [
"type": "image", "image": sample["image"]},
{"type": "text", "text": sample["query"]},
{,
]
}-truth answer
# The assistant message provides the ground= {
assistant_message "role": "assistant",
"content": [{"type": "text", "text": sample["label"][0]}],
}return [system_message, user_message, assistant_message]
This guide is compute-intensive, so we’ll only use a small fraction (10%) of the dataset for demonstration. For a production model, you would want to fine-tune on the full dataset.
from datasets import load_dataset
, eval_dataset, test_dataset = load_dataset(
train_dataset"HuggingFaceM4/ChartQA",
=["train[:10%]", "val[:10%]", "test[:10%]"]
split )
Now, let’s format the data using the chatbot structure.
= [format_data(sample) for sample in train_dataset]
train_dataset = [format_data(sample) for sample in eval_dataset]
eval_dataset = [format_data(sample) for sample in test_dataset] test_dataset
Establish a Baseline
Before fine-tuning, let’s load the base model and test its performance out-of-the-box. This will give us a baseline to measure our improvements against.
We’ll load the gemma-3-4b-it
model, which is the instruction-tuned version of Gemma 3 with 4 billion parameters.
import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
= Gemma3ForConditionalGeneration.from_pretrained(
model "google/gemma-3-4b-it",
="auto",
device_map=torch.bfloat16
torch_dtype
)= AutoProcessor.from_pretrained("google/gemma-3-4b-it") processor
Next, let’s create a helper function to streamline inference. This function will take a data sample, process it, and generate the model’s response.
generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
def , user, and assistant messages.
# The sample contains systemfor inference.
# We only need the user message = sample[1:2]
chat_for_inference
# Prepare the text prompt by applying the chat template= processor.apply_chat_template(
text_input ,
chat_for_inference=True,
add_generation_prompt=False
tokenize
)
# Prepare the image input= sample[1]["content"][0]["image"]
image if image.mode != "RGB":
= image.convert("RGB")
image
# Process both text and image= processor(
inputs =text_input,
text=[image],
images="pt",
return_tensors.to(device)
)
# Generate an answer= model.generate(**inputs, max_new_tokens=max_new_tokens)
generated_ids
# Decode the generated tokens into text= processor.batch_decode(
output_text , skip_special_tokens=True, clean_up_tokenization_spaces=False
generated_ids
)
return output_text[0]
Let’s test the base model on a sample from our training set.
= generate_text_from_sample(model, processor, train_dataset[0])
output print(output)
The output will vary, but it’s often incorrect or generic.
A model with four billion parameters is still too large to fine-tune directly on most consumer hardware. To solve this, we’ll use QLoRA (Quantized Low-Rank Adaptation), a highly efficient fine-tuning technique.
What is QLoRA?
QLoRA reduces the memory footprint of fine-tuning by combining two powerful ideas:
- Quantization: The main model weights are loaded in a 4-bit data type, drastically cutting down memory usage.
- Low-Rank Adaptation (LoRA): Instead of training all the model’s billions of parameters, we only train a small number of “adapter” matrices that are injected into the model’s architecture.
This combination allows us to fine-tune massive models on a single GPU without sacrificing much performance.
Loading the Quantized Model
First, we’ll create a BitsAndBytesConfig
to tell the transformers
library to load our model in 4-bit precision.
from transformers import BitsAndBytesConfig
= BitsAndBytesConfig(
bits_and_bytes_config =True,
load_in_4bit=True,
bnb_4bit_use_double_quant="nf4",
bnb_4bit_quant_type=torch.bfloat16
bnb_4bit_compute_dtype
)
= Gemma3ForConditionalGeneration.from_pretrained(
model "google/gemma-3-4b-it",
="auto",
device_map=torch.bfloat16,
torch_dtype=bits_and_bytes_config
quantization_config
)= AutoProcessor.from_pretrained("google/gemma-3-4b-it") processor
BitsAndBytesConfig
load_in_4bit=True
: The master switch that enables 4-bit quantization.bnb_4bit_quant_type="nf4"
: Specifies the quantization type. “nf4” (NormalFloat 4-bit) is a sophisticated data type optimized for normally distributed weights, which is common in neural networks.bnb_4bit_use_double_quant=True
: Applies a second quantization after the first one, further reducing the memory footprint.bnb_4bit_compute_dtype=torch.bfloat16
: While the model weights are stored in 4-bit, computations (like matrix multiplications) are performed in a higher-precision format (bfloat16
) to maintain accuracy and stability.
Setting Up the LoRA Configuration
Next, we define our LoraConfig
. This tells PEFT where to inject the adapter layers and how to configure them.
from peft import LoraConfig, get_peft_model
= LoraConfig(
peft_config =16,
lora_alpha=0.05,
lora_dropout=8,
r="none",
bias=['q_proj', 'v_proj'],
target_module="CAUSAL_LM"
task_type
)
= get_peft_model(model, peft_config) peft_model
LoraConfig
r
: The rank of the low-rank matrices. A smallerr
means fewer trainable parameters and faster training, but might capture less information.8
is a common starting point.lora_alpha
: A scaling factor for the LoRA weights. It’s often set to twice the value ofr
.target_modules
: A crucial parameter that specifies which layers of the base model will be adapted. For vision-language models, targeting the query (q_proj
) and value (v_proj
) projections in the attention mechanism is a common and effective strategy.task_type="CAUSAL_LM"
: Informs PEFT about the task type, ensuring the adapters are set up correctly for a causal language model.
Configuring the Trainer
Now, we configure the training process using SFTConfig
from the TRL library. This class holds all the hyperparameters for our supervised fine-tuning run.
from trl import SFTConfig
arguments
# Configure training = SFTConfig(
training_args ="gemma3-vqa-finetuned",
output_dir=3,
num_train_epochs=4,
per_device_train_batch_size=4,
per_device_eval_batch_size=8,
gradient_accumulation_steps=True,
gradient_checkpointing
# Optimizer and scheduler settings="adamw_torch_fused",
optim=2e-4,
learning_rate="constant",
lr_scheduler_type
# Logging and evaluation=10,
logging_steps=10,
eval_steps="steps",
eval_strategy="steps",
save_strategy=20,
save_steps="eval_loss",
metric_for_best_model=False,
greater_is_better=True,
load_best_model_at_end
# Mixed precision and gradient settings=True,
bf16=True,
tf32=0.3,
max_grad_norm=0.03,
warmup_ratio
# Hub and reporting=True,
push_to_hub="wandb",
report_to
# Gradient checkpointing settings={"use_reentrant": False},
gradient_checkpointing_kwargs
# Dataset configuration="", # Text field in dataset
dataset_text_field={"skip_prepare_dataset": True}
dataset_kwargs
)
.remove_unused_columns = False training_args
SFTConfig
output_dir
: The directory where training checkpoints and the final adapter model will be saved.per_device_train_batch_size
: The number of samples processed per GPU in each training step.gradient_accumulation_steps
: A memory-saving technique. Gradients are accumulated for this many steps before an optimizer update is performed. This allows you to simulate a larger batch size (4 * 8 = 32
here) without using more GPU memory.optim
: The optimizer to use.adamw_torch_fused
is a memory-efficient and fast version of the AdamW optimizer.eval_strategy="steps"
: Specifies that evaluation should be run at regular step intervals.bf16=True
: Enables mixed-precision training, which speeds up computation and reduces memory usage by performing certain operations inbfloat16
.gradient_checkpointing
: Another key memory-saving technique that trades more compute time for a significantly smaller memory footprint during the backward pass.
Creating a Data Collator
We need one final piece before training: a data collator. This function takes a list of samples from our dataset and batches them together, ensuring they are correctly padded and formatted for the model. It’s also responsible for creating the labels
for our language modeling task.
= processor.tokenizer.additional_special_tokens_ids[
image_token_id .tokenizer.additional_special_tokens.index("<image>")
processor
]
collate_fn(examples):
def 'example' is a list of dicts (system, user, assistant)
# Each = [processor.apply_chat_template(ex, tokenize=False) for ex in examples]
texts = [ex[1]["content"][0]["image"] for ex in examples]
images
# Process the batch= processor(text=texts, images=images, return_tensors="pt", padding=True)
batch
for language modeling
# Create labels = batch["input_ids"].clone()
labels in the loss calculation
# Mask padding tokens and image tokens so they are not included == processor.tokenizer.pad_token_id] = -100
labels[labels == image_token_id] = -100
labels[labels "labels"] = labels
batch[
return batch
A data collator is a function that takes a list of individual dataset items and bundles them into a single batch. Its key responsibilities are:
- Batching: Combining multiple samples into tensors.
- Padding: Making sure all sequences in the batch have the same length by adding padding tokens.
- Label Creation: For language modeling, it creates the
labels
tensor that the model uses to calculate loss. In our case, we mask out the input prompt, padding tokens, and image tokens, so the model is only trained to predict the assistant’s response.
Launching the Training
Now, we can instantiate the SFTTrainer
from TRL. It elegantly wraps the entire training loop, handling everything from data collation to model saving.
from trl import SFTTrainer
= SFTTrainer(
trainer =model,
model=training_args,
args=train_dataset,
train_dataset=eval_dataset,
eval_dataset=collate_fn,
data_collator=peft_config,
peft_config=processor.tokenizer,
processing_class )
model
and peft_config
separately?
The SFTTrainer
is smart. Instead of requiring you to wrap the model with get_peft_model
yourself, it handles it internally. You provide the base (quantized) model and the peft_config
, and the trainer sets up the PEFT model for you.
To start training, all we need to do is call one method:
trainer.train()
This will kick off the training process, and once training is complete, the best adapter checkpoint will be saved in the output_dir
we specified.
Test the Fine-Tuned Model
We’re done! Now for the moment of truth. Let’s load our fine-tuned adapter and see if the model’s performance has improved.
First, we reload the original 4-bit quantized model. Then, we use the load_adapter
method to attach our trained LoRA weights.
= Gemma3ForConditionalGeneration.from_pretrained(
model "google/gemma-3-4b-it",
="auto",
device_map=torch.bfloat16,
torch_dtype=bits_and_bytes_config
quantization_config
)
.load_adapter("gemma3-vqa-finetuned") model
Let’s pick a sample from the test set that the model hasn’t seen before.
= test_dataset[21]
test_sample print("Question:", test_sample[1]['content'][1]['text'])
1]['content'][0]['image'] test_sample[
And now, let’s generate an answer with our fine-tuned model.
= generate_text_from_sample(model, processor, test_sample)
output print("Model Answer:", output)
You should now see a much more accurate and direct answer to the question, demonstrating the power of fine-tuning. We’ve successfully taught the model a new skill—interpreting charts—without the prohibitive cost of a full fine-tune.