All About Configs¶
This deep-dive will guide you through writing configs for running recipes.
- How to write a YAML config and run a recipe with it 
- How to use - instantiateand- parseAPIs
- How to effectively use configs and CLI overrides for running recipes 
- Be familiar with the overview of torchtune 
- Make sure to install torchtune 
- Understand the fundamentals of recipes 
Where do parameters live?¶
There are two primary entry points for you to configure parameters: configs and
CLI overrides. Configs are YAML files that define all the
parameters needed to run a recipe within a single location. They are the single
source of truth for reproducing a run. The config parameters can be overridden on the
command-line using tune for quick changes and experimentation without
modifying the config.
Writing configs¶
Configs serve as the primary entry point for running recipes in torchtune. They are expected to be YAML files and they simply list out values for parameters you want to define for a particular run.
seed: null
shuffle: True
device: cuda
dtype: fp32
enable_fsdp: True
...
Configuring components using instantiate¶
Many fields will require specifying torchtune objects with associated keyword
arguments as parameters. Models, datasets, optimizers, and loss functions are
common examples of this. You can easily do this using the _component_
subfield. In _component_, you need to specify the dotpath of the object
you wish to instantiate in the recipe. The dotpath is the exact path you would use
to import the object normally in a Python file. For example, to specify the
alpaca_dataset in your config with custom
arguments:
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  train_on_input: False
Here, we are changing the default value for train_on_input from True
to False.
Once you’ve specified the _component_ in your config, you can create an
instance of the specified object in your recipe’s setup like so:
from torchtune import config
# Access the dataset field and create the object instance
dataset = config.instantiate(cfg.dataset)
This will automatically use any keyword arguments specified in the fields under
dataset.
As written, the preceding example will actually throw an error. If you look at the method for alpaca_dataset,
you’ll notice that we’re missing a required positional argument, the tokenizer.
Since this is another configurable torchtune object, let’s understand how to handle
this by taking a look at the instantiate() API.
def instantiate(
    config: DictConfig,
    *args: Tuple[Any, ...],
    **kwargs: Dict[str, Any],
)
instantiate() also accepts positional arguments
and keyword arguments and automatically uses that with the config when creating
the object. This means we can not only pass in the tokenizer, but also add additional
keyword arguments not specified in the config if we’d like:
# Tokenizer is needed for the dataset, configure it first
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  path: /tmp/tokenizer.model
dataset:
  _component_: torchtune.datasets.alpaca_dataset
# Note the API of the tokenizer we specified - we need to pass in a path
def llama2_tokenizer(path: str) -> Llama2Tokenizer:
# Note the API of the dataset we specified - we need to pass in a model tokenizer
# and any optional keyword arguments
def alpaca_dataset(
    tokenizer: ModelTokenizer,
    train_on_input: bool = True,
    max_seq_len: int = 512,
) -> InstructDataset:
from torchtune import config
# Since we've already specified the path in the config, we don't need to pass
# it in
tokenizer = config.instantiate(cfg.tokenizer)
# We pass in the instantiated tokenizer as the first required argument, then
# we change an optional keyword argument
dataset = config.instantiate(
    cfg.dataset,
    tokenizer,
    train_on_input=False,
)
Note that additional keyword arguments will overwrite any duplicated keys in the config.
Referencing other config fields with interpolations¶
Sometimes you need to use the same value more than once for multiple fields. You
can use interpolations to reference another field, and instantiate()
will automatically resolve it for you.
output_dir: /tmp/alpaca-llama2-finetune
metric_logger:
  _component_: torchtune.utils.metric_logging.DiskLogger
  log_dir: ${output_dir}
Validating your config¶
We provide a convenient CLI utility, tune validate, to quickly verify that
your config is well-formed and all components can be instantiated properly. You
can also pass in overrides if you want to test out the exact commands you will run
your experiments with. If any parameters are not well-formed, tune validate
will list out all the locations where an error was found.
tune cp llama2/7B_lora_single_device ./my_config.yaml
tune validate ./my_config.yaml
Best practices for writing configs¶
Let’s discuss some guidelines for writing configs to get the most out of them.
Airtight configs¶
While it may be tempting to put as much as you can in the config to give you maximum flexibility in switching parameters for your experiments, we encourage you to only include fields in the config that will be used or instantiated in the recipe. This ensures full clarity on the options a recipe was run with and will make it significantly easier to debug.
# dont do this
alpaca_dataset:
  _component_: torchtune.datasets.alpaca_dataset
slimorca_dataset:
  ...
# do this
dataset:
  # change this in config or override when needed
  _component_: torchtune.datasets.alpaca_dataset
Use public APIs only¶
If a component you wish to specify in a config is located in a private file, use
the public dotpath in your config. These components are typically exposed in their
parent module’s __init__.py file. This way, you can guarantee the stability
of the API you are using in your config. There should be no underscores in your
component dotpath.
# don't do this
dataset:
  _component_: torchtune.datasets._alpaca.alpaca_dataset
# do this
dataset:
  _component_: torchtune.datasets.alpaca_dataset
Command-line overrides¶
Configs are the primary location to collect all your parameters to run a recipe,
but sometimes you may want to quickly try different values without having to update
the config itself. To enable quick experimentation, you can specify override values
to parameters in your config via the tune command. These should be specified
as key-value pairs k1=v1 k2=v2 ...
For example, to run the lora_finetune_single_device recipe with custom model and tokenizer directories, you can provide overrides:
tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
checkpointer.checkpoint_dir=/home/my_model_checkpoint \
checkpointer.checkpoint_files=['file_1','file_2'] \
tokenizer.path=/home/my_tokenizer_path
Overriding components¶
If you would like to override a class or function in the config that is instantiated
via the _component_ field, you can do so by assigning to the parameter
name directly. Any nested fields in the components can be overridden with dot notation.
dataset:
  _component_: torchtune.datasets.alpaca_dataset
# Change to slimorca_dataset and set train_on_input to True
tune run lora_finetune_single_device --config my_config.yaml \
dataset=torchtune.datasets.slimorca_dataset dataset.train_on_input=True
Removing config fields¶
You may need to remove certain parameters from the config when changing components
through overrides that require different keyword arguments. You can do so by using
the ~ flag and specify the dotpath of the config field you would like to remove.
For example, if you want to override a built-in config and use the bitsandbytes.optim.PagedAdamW8bit
optimizer, you may need to delete parameters like foreach which are
specific to PyTorch optimizers. Note that this example requires that you have bitsandbytes
installed.
# In configs/llama3/8B_full.yaml
optimizer:
  _component_: torch.optim.AdamW
  lr: 2e-5
  foreach: False
# Change to PagedAdamW8bit and remove fused, foreach
tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full \
optimizer=bitsandbytes.optim.PagedAdamW8bit ~optimizer.foreach