Thumbnail for How to Run Your Own LLM Locally in Python — Using Bielik-7B and Hugging Face Transformers

How to Run Your Own LLM Locally in Python — Using Bielik-7B and Hugging Face Transformers

Published: 2025-11-10

How to Run Your Own LLM Locally in Python — Using Bielik-7B and Hugging Face Transformers

In recent months, we’ve seen a boom in local large language models (LLMs) Bielik-7B by the SpeakLeash team. The best part? You no longer need an API to run a powerful AI model on your laptop.

Why Run Models Locally

Privacy — Your data stays on your machine. Faster Iteration — No API limits or throttling. Cost ≈ 0 — No API fees. Full Control — Modify, train, or inspect how your model works.

This guide will walk you through running the Polish Bielik-7B model locally on Apple Silicon (M1/M2/M3) using Python and the Transformers library — in just a few lines of code. You’ll even build a simple Polish chatbot.

Choosing the Model — Bielik-7B

Bielik-7B is an open-source, Polish-language large language model created by the SpeakLeash team. Available on Hugging Face: 👉 https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1

Variants:

  • Bielik-7B-Base – Raw, pre-trained model.

  • Bielik-7B-Instruct – Fine-tuned for conversation (ChatGPT-style).

  • For interactive use cases or chatbots, always pick the Instruct version.

Environment Setup

Let’s start with the basics — install Python 3.10+ and a few libraries:

pip install torch transformers accelerate

💡 If you’re on Apple Silicon (M1/M2/M3), make sure you have torch ≥ 2.1 with MPS (Metal Performance Shaders) support — this allows GPU acceleration on macOS.

Check GPU availability:

import torch
print(torch.backends.mps.is_available())

If the output is True, you’re good to go!

Running the Model — Step by Step

Copy the following code into bielik_demo.py and run it.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Model ID from Hugging Face
model_id = "speakleash/Bielik-7B-Instruct-v0.1"

# 2. Device — MPS (Apple GPU) or CPU
DEVICE = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

# 3. Load tokenizer
print(f"Loading tokenizer for {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 4. Load model
print(f"Loading model on {DEVICE}...")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # memory efficient
    device_map=DEVICE.type
)
model.eval()

# 5. Prepare the prompt
prompt_text = "Write a short poem about an awakening artificial intelligence."
formatted_prompt = f"[[INST]] {prompt_text} [[/INST]]"
input_ids = tokenizer.encode(formatted_prompt, return_tensors="pt").to(DEVICE)

# 6. Generate response
print("\nGenerating response...")
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=256,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id
    )

# 7. Decode result
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
final_response = generated_text.split("[[/INST]]")[-1].strip()
print("\n--- Bielik’s Response ---\n")
print(final_response)
print("\n--------------------------\n")

⚠️ Tip: If you encounter RuntimeError: MPS backend out of memory, try reducing max_length or use torch_dtype=torch.float16.

Under the Hood

The Transformers library from Hugging Face simplifies complex model operations:

[Prompt] → [Tokenizer] → [Model (GPU)] → [Sampling] → [Decoder] → [Text]

  • AutoTokenizer splits text into tokens.
  • AutoModelForCausalLM handles decoder-only models (GPT, Llama, Bielik).
  • generate() performs token sampling based on probability distributions.
  • temperature, top_k, and top_p control creativity and randomness.
  • MPS (Metal Performance Shaders) enables GPU computation on macOS.

5️⃣ Local vs API Comparison

Aspect Local Model (Bielik-7B) Cloud API (e.g. GPT-4)
Privacy 🔒 Full data control 🔄 Data sent to cloud
Cost 💸 Free 💰 Pay per token
Performance ⚡ Hardware-dependent ⚙️ Cloud-scalable
Quality 🧠 Good (esp. Polish) 🧠🧠🧠 Very high
Offline Access ✅ Yes ❌ No

On a MacBook M2 Pro, Bielik-7B can generate around 5–10 tokens/s, which is more than enough for prototyping.

Your First Local Chatbot

Add a simple interactive loop to chat with Bielik directly from your terminal:

while True:
    prompt = input("You: ")
    if prompt.lower() in ["exit", "quit"]:
        break

    input_ids = tokenizer(f"[[INST]] {prompt} [[/INST]]", return_tensors="pt").to(DEVICE)
    output = model.generate(input_ids, max_length=200, temperature=0.7, top_p=0.9)
    response = tokenizer.decode(output[0], skip_special_tokens=True)

    print("Bielik:", response.split("[[/INST]]")[-1].strip())

Run it:

python bielik_chat.py

And that’s it — you now have your own offline Polish chatbot running entirely on your laptop.

What’s Next?

Now that your local LLM is running, here are some ideas for next steps:

Fine-tuning / LoRA — Train it on your own data. RAG (Retrieval-Augmented Generation) — Connect Bielik to your knowledge base. AI Agent in Python — Integrate with APIs and memory systems.

Summary

Running your own LLM locally in Python takes just a few lines of code. With Hugging Face Transformers and GPU support via MPS, you can:

  • Experiment with AI without relying on the cloud.
  • Keep your data private.
  • Build your own NLP projects — in Polish and for free.
Back to Blog