Getting Started with ALLaM-7B-Instruct-preview

ALLaM-7B-Instruct-preview is a powerful 7-billion parameter large language model developed by the National Center for Artificial Intelligence (NCAI) at the Saudi Data and AI Authority (SDAIA). It's specifically trained for both Arabic and English, making it a valuable tool for bilingual applications. This tutorial guides you through setting up and using the model directly in Python and explains how to interact with it from JavaScript via a hosted API endpoint.

Introduction to ALLaM

The model is part of the ALLaM series, designed to advance Arabic Language Technology (ALT). This specific version (ALLaM-7B-Instruct-preview) is instruction-tuned, meaning it's optimized to follow user instructions provided in prompts. It's built using an autoregressive transformer architecture and supports a context length of 4096 tokens.

Python Usage with `transformers`

The primary way to interact with ALLaM is through the Hugging Face transformers library in Python.

Setup

Install Libraries: You'll need transformers and torch. It's highly recommended to have a CUDA-enabled GPU for reasonable performance.

pip install transformers torch
# Or for CUDA support (ensure your PyTorch version matches your CUDA version):
# pip install transformers torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Example for CUDA 11.8

Hugging Face Account (Optional): Depending on model access permissions, you might need to be logged into your Hugging Face account. You can log in via the CLI:
```
huggingface-cli login
```

Example Code

The following script loads the model and tokenizer, prepares an input prompt (in Arabic), generates a response, and prints it.

# -*- coding: utf-8 -*-
"""
Example usage for the ALLaM-AI/ALLaM-7B-Instruct-preview model from Hugging Face.
 
This script demonstrates how to load the model and tokenizer using the
transformers library and generate text based on a sample prompt.
 
Requirements:
- transformers>=4.40.1
- torch
- A CUDA-enabled GPU is highly recommended for performance.
"""
 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
# --- Configuration ---
MODEL_NAME = "ALLaM-AI/ALLaM-7B-Instruct-preview"
# Set device to CUDA if available, otherwise CPU (will be very slow on CPU)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")
 
# --- Load Model and Tokenizer ---
try:
    print(f"Loading model: {MODEL_NAME}...")
    # Consider adding torch_dtype=torch.bfloat16 if memory is an issue and GPU supports it
    allam_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
    print(f"Loading tokenizer: {MODEL_NAME}...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    print("Model and tokenizer loaded successfully.")
 
    # Move model to the selected device
    allam_model = allam_model.to(DEVICE)
 
except Exception as e:
    print(f"Error loading model or tokenizer: {e}")
    print("Please ensure you have the necessary libraries installed and are logged in to Hugging Face if required.")
    exit(1)
 
# --- Prepare Input ---
# Example prompt (Arabic)
messages = [
    {"role": "user", "content": "كيف أجهز كوب شاهي؟"}, # "How do I prepare a cup of tea?"
]
 
# Apply the chat template (handles formatting for the model)
# Note: The model card mentions the system prompt is integrated here.
# You could potentially add a system message like:
# messages = [
#     {"role": "system", "content": "You are ALLaM, a bilingual English and Arabic AI assistant."},
#     {"role": "user", "content": "كيف أجهز كوب شاهي؟"},
# ]
try:
    print("Applying chat template...")
    # tokenize=False first to get the formatted string, then tokenize
    formatted_input_string = tokenizer.apply_chat_template(messages, tokenize=False)
    print(f"Formatted input string:\n{formatted_input_string}")
 
    print("Tokenizing input...")
    inputs = tokenizer(formatted_input_string, return_tensors='pt', return_token_type_ids=False)
 
    # Move inputs to the selected device
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    print("Input prepared for the model.")
 
except Exception as e:
    print(f"Error preparing input: {e}")
    exit(1)
 
# --- Generate Response ---
print("Generating response...")
try:
    # Generation parameters (adjust as needed)
    generation_params = {
        "max_new_tokens": 4096, # Max tokens for the *newly generated* text
        "do_sample": True,     # Use sampling for more creative output
        "top_k": 50,           # Consider only the top K most likely tokens
        "top_p": 0.95,         # Use nucleus sampling (cumulative probability)
        "temperature": 0.6     # Controls randomness (lower = more deterministic)
    }
    print(f"Generation parameters: {generation_params}")
 
    with torch.no_grad(): # Disable gradient calculations for inference
        response_ids = allam_model.generate(**inputs, **generation_params)
 
    print("Decoding response...")
    # Decode the generated token IDs back to text
    # response_ids[0] accesses the first sequence in the batch
    # skip_special_tokens=True removes tokens like <bos>, <eos>
    decoded_response = tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0]
 
    # Often the decoded response includes the input prompt, we might want to remove it.
    # This simple approach assumes the response starts exactly with the formatted input.
    # More robust methods might be needed depending on the model's exact output format.
    if decoded_response.startswith(formatted_input_string):
         final_output = decoded_response[len(formatted_input_string):].strip()
    else:
         # Fallback if the prompt isn't exactly at the start (might happen with some templates/models)
         # This might still include parts of the prompt depending on the tokenizer/template behavior.
         # A more robust way might involve finding the specific turn separator used by the template.
         print("Warning: Could not cleanly separate prompt from response. Displaying full decoded output.")
         final_output = decoded_response # Show the full thing if separation fails
 
    print("\n--- Generated Response ---")
    print(final_output)
    print("--------------------------\n")
 
except Exception as e:
    print(f"Error during generation or decoding: {e}")
    exit(1)
 
print("Script finished successfully.")

Running the Python Example

Save the code above as allam_example.py and run it from your terminal:

python allam_example.py

The script will load the model (this might take some time and download files on the first run), process the input, generate the text, and print the result.

JavaScript Usage (via Hosted API)

Running a 7-billion parameter model like ALLaM directly within a standard JavaScript environment (like a web browser or Node.js) using libraries like @xenova/transformers is generally not feasible due to the model's large size and high resource requirements (RAM, VRAM, CPU/GPU power).

The practical way to interact with such a model from JavaScript is by calling an API endpoint where the model is hosted on suitable backend infrastructure. Platforms like Hugging Face Spaces or dedicated Inference Endpoints allow you to deploy the model and expose it via an API.

Example JavaScript Client Code (Calling a Hosted API)

This code demonstrates how to use fetch to send a prompt to a hypothetical API endpoint hosted on Hugging Face Spaces. You would need to deploy the model first and replace the placeholder URL.

/**
 * Example JavaScript usage for interacting with the ALLaM model hosted on Hugging Face Spaces.
 *
 * IMPORTANT: This script assumes you have deployed the ALLaM model within a
 * backend application (e.g., using Python FastAPI/Gradio and the `transformers` library)
 * on Hugging Face Spaces. You need to replace the placeholder URL below
 * with the actual public URL of your deployed Space's API endpoint.
 */
 
// !!! REPLACE THIS WITH YOUR ACTUAL HUGGING FACE SPACE API ENDPOINT URL !!!
// It might look something like: https://your-username-your-space-name.hf.space/generate
const HF_SPACE_API_URL = "https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space/generate"; // Placeholder URL
 
// Example prompt
const userPrompt = "كيف أجهز كوب شاهي؟";
 
async function generateTextViaHFSpace(promptText) {
  console.log(`Sending prompt to HF Space: "${promptText}" at ${HF_SPACE_API_URL}`);
 
  if (HF_SPACE_API_URL.includes("YOUR_USERNAME-YOUR_SPACE_NAME")) {
    console.error("Error: Please replace the placeholder HF_SPACE_API_URL with your actual Space endpoint URL.");
    return null;
  }
 
  try {
    const response = await fetch(HF_SPACE_API_URL, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        // "Authorization": "Bearer YOUR_HF_TOKEN_IF_NEEDED", // Add if your Space is private
      },
      body: JSON.stringify({ prompt: promptText }), // Adjust payload as needed by your API
    });
 
    if (!response.ok) {
      let errorDetails = `HTTP error! status: ${response.status}`;
      try { const errorBody = await response.json(); errorDetails += `, Body: ${JSON.stringify(errorBody)}`; }
      catch (e) { const textBody = await response.text(); errorDetails += `, Body: ${textBody}`; }
      throw new Error(errorDetails);
    }
 
    const data = await response.json();
    console.log("Received response from HF Space.");
    // Adjust the key based on what your Space API returns
    return data.generated_text || data.response || data;
 
  } catch (error) {
    console.error("Error calling Hugging Face Space API:", error);
    throw error;
  }
}
 
// Execute
(async () => {
  try {
    const generatedText = await generateTextViaHFSpace(userPrompt);
    if (generatedText !== null) {
        console.log("\n--- Generated Response (from HF Space) ---");
        console.log(generatedText);
        console.log("------------------------------------------\n");
    }
  } catch (error) {
    console.error("Failed to get generation from Hugging Face Space.");
  }
})();
 
// To run this (assuming Node.js):
// 1. Deploy ALLaM in a Hugging Face Space with an API endpoint.
// 2. Update the HF_SPACE_API_URL constant in this script.
// 3. Run `node allam_hf_space_example.js`.

Requirement: This JavaScript approach requires you to first host the ALLaM model behind an API. See the next section for conceptual steps using Hugging Face Spaces.

Hosting ALLaM for API Access (Hugging Face Spaces)

Hugging Face Spaces provide a platform to host ML applications, including serving models via APIs. Here's a conceptual overview of deploying ALLaM to a Space:

Create a New Space: Go to Hugging Face and create a new Space, choosing a suitable SDK (like Docker or Gradio/FastAPI). You'll likely need to select paid hardware (e.g., a GPU instance like A10G) to run a 7B model effectively.

Define Dependencies: Create a requirements.txt file listing necessary Python libraries:

transformers>=4.40.1
torch
fastapi
uvicorn
accelerate # Often needed for efficient model loading
# Add other libraries as needed

Create Backend App (e.g., app.py with FastAPI):
- Import necessary libraries (FastAPI, transformers, torch, etc.).
- Load the ALLaM model and tokenizer on startup (using AutoModelForCausalLM.from_pretrained, similar to the Python example). Ensure it's loaded onto the correct device (GPU if available in the Space).
- Define a Pydantic model for the request body (e.g., containing a prompt field).
- Create a FastAPI POST endpoint (e.g., /generate).
- Inside the endpoint function:
  - Receive the prompt from the request body.
  - Prepare the input using tokenizer.apply_chat_template.
  - Generate the response using model.generate().
  - Decode the response using tokenizer.batch_decode.
  - Return the generated text in a JSON response.
Configure Space: Ensure your Space configuration uses the correct Python file (app.py) and installs the dependencies from requirements.txt.
Deploy: Commit your files (app.py, requirements.txt, etc.) to the Space repository. Hugging Face will build and deploy the application.
Get API URL: Once deployed, your Space will have a public URL. The API endpoint will be at https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space/YOUR_ENDPOINT (e.g., /generate). Use this URL in your JavaScript client code.

Note: This is a high-level overview. Building and deploying robust applications on HF Spaces involves details like error handling, resource management, and potentially authentication. Refer to the Hugging Face Spaces Documentation for detailed guides.

System Prompts

The ALLaM model is optimized to work without a system prompt by default. However, you can provide one if needed by adding a message with role: "system" to the messages list before your user prompt in the Python code.

Examples:

English: {"role": "system", "content": "You are ALLaM, a bilingual English and Arabic AI assistant."}
Arabic: {"role": "system", "content": "أنت علام، مساعد ذكاء اصطناعي مطور من الهيئة السعودية للبيانات والذكاء الاصطناعي..."}

Ethical Considerations

Remember that LLMs like ALLaM can sometimes produce incorrect or biased outputs. It's crucial to implement safety measures and evaluate the model's suitability for your specific application. The generated output does not represent official statements from NCAI or SDAIA.

Reference: ALLaM-AI/ALLaM-7B-Instruct-preview Model Card on Hugging Face.