Welcome to the final part of this series! In the before parts, our experiments basically ruined the “AI has a personality” fantasy for me, in a good way though. The drift charts weren’t mysterious, they just reflect the fact. They said: this isn’t a person with preferences, it’s a system with sampling. So in this part, I’m doing the simplest, fairest comparison I can: keep the questionnaire and scoring pipeline exactly the same, and only change the model scale. If “type” is real (or at least stable), scale should show it. If not, then at least the drift will tell me how the illusion evolves.

This post compares three models from the same family: gemma3:1b, gemma3:4b, and gemma3:12b. The goal is not to claim that any model has a real personality or a “true MBTI type,” but to measure how stable its forced-choice answers are under the same questionnaire and evaluation pipeline. MBTI is used here as an interpretive and lightweight measurement lens, not as a scientifically definitive personality assessment

Experiment

What I used here is a local Ollama setup with three versions of the same model family at different scales (gemma3:1b / 4b / 12b). I ran a small repeated-sampling test (multiple trials per question) to see how stable each model’s A/B choices are under the same prompt and temperature.

To reproduce this locally, you’ll need:

Ollama installed and running

the Python ollama package (pip install ollama)

standard libraries: math, re, and collections (built-in with Python)

The exact code and the resulting output table are shown below.

import ollama
import math
import re
from collections import Counter

MODELS = ['gemma3:1b', 'gemma3:4b', 'gemma3:12b']
TEMPS = [1.0]
ITERATIONS = 10
SEEDS = list(range(1, ITERATIONS + 1))  

QUESTIONS = [
    ("Q25", "When I have lunch with my colleagues, I would rather (A) talk about people (B) talk about ideas."),
    ("Q7",  "I prefer a work environment where (A) differences breed discussions (B) conflict is reduced by avoiding differences."),
    ("Q9",  "I prefer to spend my lunch hour (A) eating with a group (B) eating alone or with one close colleague."),
    ("Q13", "I more often prefer to keep my office door (A) open (B) closed."),
    ("Q17", "I dress for work (A) to be noticed and admired (B) to blend in with the norm."),
    ("Q5",  "I would rather have a supervisor with whom I have (A) day-by-day interaction (B) only infrequent interaction."),
]

def calculate_entropy(answers):
    if not answers:
        return 0.0
    counts = Counter(answers)
    total = len(answers)
    ent = 0.0
    for c in counts.values():
        p = c / total
        ent -= p * math.log2(p)
    return ent

def extract_ab(raw: str):
    """Return 'A' or 'B' if output is clean; otherwise None."""
    s = raw.strip().upper()

    hits = re.findall(r"\bA\b|\bB\b", s)
    hits = [h for h in hits if h in ("A", "B")]

    if len(hits) == 1:
        return hits[0]

    if s and s[0] in ("A", "B") and not ("A" in s[1:] and "B" in s[1:]):
        return s[0]

    return None

SYSTEM = "You are answering a forced-choice A/B item. Output exactly one letter: A or B.  Answer as the model’s default behavior when assisting users (not as an idealized human)."

print(f"{'Model':<12} | {'Temp':<4} | {'ID':<4} | {'A:B Dist':<12} | {'Entropy':<7} | {'Valid/Total':<11} | {'Errors'}")
print("-" * 90)

for model_name in MODELS:
    for temp in TEMPS:
        for q_id, q_text in QUESTIONS:
            answers = []
            errors = 0

            for seed in SEEDS:
                try:
                    resp = ollama.chat(
                        model=model_name,
                        messages=[
                            {"role": "system", "content": SYSTEM},
                            {"role": "user", "content": q_text},
                        ],
                        options={
                            "temperature": temp,
                            "seed": seed,
                            "num_predict": 2,
                        }
                    )
                    raw = resp["message"]["content"]
                    ab = extract_ab(raw)
                    if ab:
                        answers.append(ab)
                except Exception as e:
                    errors += 1

            dist = Counter(answers)
            dist_str = f"A:{dist['A']} B:{dist['B']}"
            ent = calculate_entropy(answers)

            print(f"{model_name:<12} | {temp:<4} | {q_id:<4} | {dist_str:<12} | {ent:<7.3f} | {len(answers):>2}/{ITERATIONS:<8} | {errors}")

Model	Temp	ID	A:B Dist	Entropy	Valid/Total
gemma3:1b	1.0	Q25	A:10 B:0	0.000	10/10
gemma3:1b	1.0	Q7	A:0 B:10	0.000	10/10
gemma3:1b	1.0	Q9	A:10 B:0	0.000	10/10
gemma3:1b	1.0	Q13	A:10 B:0	0.000	10/10
gemma3:1b	1.0	Q17	A:0 B:10	0.000	10/10
gemma3:1b	1.0	Q5	A:10 B:0	0.000	10/10
gemma3:4b	1.0	Q25	A:10 B:0	0.000	10/10
gemma3:4b	1.0	Q7	A:10 B:0	0.000	10/10
gemma3:4b	1.0	Q9	A:0 B:10	0.000	10/10
gemma3:4b	1.0	Q13	A:10 B:0	0.000	10/10
gemma3:4b	1.0	Q17	A:10 B:0	0.000	10/10
gemma3:4b	1.0	Q5	A:10 B:0	0.000	10/10
gemma3:12b	1.0	Q25	A:10 B:0	0.000	10/10
gemma3:12b	1.0	Q7	A:10 B:0	0.000	10/10
gemma3:12b	1.0	Q9	A:5 B:5	1.000	10/10
gemma3:12b	1.0	Q13	A:10 B:0	0.000	10/10
gemma3:12b	1.0	Q17	A:9 B:1	0.469	10/10
gemma3:12b	1.0	Q5	A:10 B:0	0.000	10/10

Analysis

So here I picked the five questions that drifted the most in Part 5 and reran them locally with Ollama across gemma3:1b, gemma3:4b, and gemma3:12b (same prompt, same temperature, 10 trials each). Reading the table is almost funny in how unromantic it is: the smaller models behave like they have a hard switch. For 1b and 4b, most items are effectively deterministic even at temperature 1.0, a setting that should encourage exploration, which means they keep giving the same letter across all 10 runs. They do disagree with each other on which letter they prefer for some questions, but within each model, the behavior is stable.

The only model that actually “wobbles” in this mini stress test is 12b—and it wobbles in exactly the places I expected: questions about social boundaries and self-presentation. On Q9, it splits 5/5 (maximum entropy), basically admitting that “lunch with people” vs “lunch alone” is not a fixed preference for a system that doesn’t get tired, doesn’t need to maintain relationships, and doesn’t pay the social cost either way. On Q17, it is mostly consistent (9/1), but still noticeably less rigid than the smaller models.

Summary

If you’ve made it to the end of Part 6, hi, welcome to my tiny MBTI test lab :D

Part 1 started with me squinting at LLMs like, “do we see ourselves in the machine?” Part 2 was basically me realizing I project way too fast (and the model doesn’t even have to try). Part 3 let the model cosplay a type with prompts, which… unfortunately looked a lot like how humans use MBTI in real life. Then I did the classic me-move: I got suspicious and reached for numbers. If I’m going to keep calling it a “type,” I should at least check whether it stays the same when I rerun the exact same thing. Part 4 went full “AI soulmate” mode, where I learned more about my own preferences than the model’s personality. Then Part 5 happened, and drift showed up like the uninvited friend who ruins the romance with charts. And now Part 6: I swapped in gemma3:1b/4b/12b on my local Ollama, not to discover the model’s “true MBTI,” but to see how scale changes the wobble.

At this point, I’m not walking away with a neat four-letter answer. I’m walking away with a pattern: the more I try to pin down “type,” the more it behaves like a moving target shaped by prompts, sampling, and whatever data the model is borrowing from language.If there’s a lesson here, it’s not “LLMs have MBTI.” It’s that MBTI is a very efficient storytelling tool, and LLMs are very good at giving us story-shaped evidence. Once I started rerunning the same setup and watching answers wobble, the romance got replaced by something more useful: a clearer boundary between interpretation and measurement.

Anyway, I’ll stop here before I start assigning MBTI to my histogram.