Skip to content

HuggingFace backend

actantial.backends.huggingface.HuggingFaceBackend

Bases: LLMBackend

Backend for locally loaded HuggingFace models.

Loads the model and tokenizer from the HuggingFace Hub at initialisation. Quantisation via bitsandbytes (4-bit) is supported, but requires a CUDA GPU.

__init__(repository, model_name, quantisation=False, torch_dtype='auto', temperature=None, do_sample=False, top_p=None, top_k=None, **kwargs)

Load the model and tokenizer from the HuggingFace Hub.

Parameters:

Name Type Description Default
repository str

HuggingFace repository name (e.g., deepseek-ai).

required
model_name str

Model identifier within the repository (e.g., DeepSeek-R1-Distill-Qwen-32B).

required
quantisation bool

If True, load the model in 4-bit precision using bitsandbytes. Requires a CUDA GPU.

False
torch_dtype str

Floating-point precision passed to from_pretrained. Accepts "auto" (default), "float16", or "bfloat16".

'auto'
temperature Optional[float]

Sampling temperature; higher values increase randomness.

None
do_sample bool

If True, use sampling; defaults to False for deterministic (greedy) output.

False
top_p Optional[float]

Nucleus sampling probability threshold.

None
top_k Optional[int]

Top-k sampling parameter.

None
**kwargs Any

Additional arguments passed to AutoModelForCausalLM.from_pretrained.

{}

generate(prompt, max_new_tokens=2048, **kwargs)

Generate text from a prompt.

Parameters:

Name Type Description Default
prompt str

The input prompt string.

required
max_new_tokens int

Maximum number of tokens to generate.

2048
**kwargs Any

Additional parameters passed to the model's generate method.

{}

Returns:

Type Description
str

The generated text string, excluding the input prompt.

cleanup()

Unload model and free GPU memory.