5 minute read

I was looking for a way to generate strong and easy to type passwords, and ended up with a simple method of generating a large word list. At the end of the post I’ll provide the code, which makes it easy for you to bring in your own input text corpus and generate your own word list according to your needs.

Passphrase

As far as I can tell, this relevant xkcd popularized the use of a passphrase as a means of generating passwords. Unfortunately, 44 bits of entropy is now considered insufficient. From a cursory search, proton recommends >100 bits of entropy for a “strong” password. Using the xkcd method, we’d need at least 9 english words, which is a lot of letters to type!

One might ask: why do we need easy to type passwords if I’m already using a password manager with random password generation and autofill? That’s great, but there are still going to be some passwords to be manually typed on a regular basis, e.g. computer login or the master password of the password manager.

Is it possible to have a high entropy password that isn’t crazy long?

Pronounceable Gibberish

I don’t know whom to credit, but the idea behind pronounceable gibberish is to increase the entropy per “word” by not actually using words, but using a string of letters that look somewhat like a word instead. So, like, “batim”, “phrux” and whatnot. Still a lot easier to memorize and type than random letters like “bFxls”, but much higher entropy per letter than using real English words.

Ideally, I’d like to have a word list of words no longer than 5 letters, containing at least 2^20 words (around 1 million), then I can plug it into KeePassXC to generate 80 bit passwords for normal passwords and 100 bit passwords for the master password. I looked around online but couldn’t really find anything like this.

Crafting a generator

I spent two days trying to craft a gibberish generator using GPT-OSS-120B, using techniques like alternating consonants and vowels, and stretching the definitions of each as much as I can. In the end I mostly failed - I couldn’t get past 500k words without starting to generate words that are unpronounceable.

Then I came up with the following algorithm, which was simple and good enough that I got what I wanted in 20 mins. The idea is to generate all possible 5 letter strings, run them through some scoring function, then take the words with the highest score. The scoring is a bit more involved but mostly GPT-OSS took care of it: take some large text corpus as input (e.g. a large list of uncommon english words), generate some log-likelihood of trigrams (i.e. how likely it is for a given three letter combination to appear in a word), then compute the log likelihood of the candidate accordingly. I’m honestly not sure how it worked in full - I didn’t bother to read the code. Since it’s just for my own use, I didn’t do anything like filter out bad words or filter out common passwords.

Here is the code if you’d like to run it yourself:

"""
generate_pseudowords.py

Usage
-----
    python generate_pseudowords.py <corpus.txt> [output.txt]
"""

import sys
import math
import itertools
import heapq
from collections import defaultdict, Counter

ALPHABET = "abcdefghijklmnopqrstuvwxyz"
START = "<"
END   = ">"
N_TOP = 2 ** 20
SMOOTHING_ALPHA = 0.001
MAX_LEN = 5

def clean_word(word: str) -> str:
    """Return a lower‑cased word containing only alphabetic chars."""
    return ''.join(ch for ch in word.lower() if ch in ALPHABET)

def build_trigram_counts(corpus_path: str):
    """
    Scan the corpus and count every trigram, including start/end markers.
    Returns:
        trigram_counts: dict{(a,b,c): int}
        bigram_counts : dict{(a,b): int}   (needed for conditional probs)
    """
    trigram_counts = Counter()
    bigram_counts  = Counter()

    with open(corpus_path, encoding='utf-8') as f:
        for line in f:
            # split on whitespace – Wikipedia already tokenises fairly well
            for raw in line.split():
                w = clean_word(raw)
                if not w:                # skip empty tokens
                    continue
                # pad with start/end markers
                padded = START + w + END
                # slide a window of three characters
                for i in range(len(padded) - 2):
                    a, b, c = padded[i], padded[i+1], padded[i+2]
                    trigram_counts[(a,b,c)] += 1
                    bigram_counts[(a,b)]     += 1   # count of the context

    return trigram_counts, bigram_counts

def trigram_logprob(trigram, trigram_counts, bigram_counts, vocab_size, alpha):
    """Return log P(c | a,b) with add‑α smoothing."""
    a,b,c = trigram
    num = trigram_counts.get((a,b,c), 0) + alpha
    den = bigram_counts.get((a,b), 0) + alpha * vocab_size
    # Guard against log(0) – denominator is never zero because of smoothing
    return math.log(num / den)

def score_word(word, trigram_counts, bigram_counts, vocab_size, alpha):
    """
    Compute the log‑probability score of a word (without start/end markers).
    The word must already be cleaned (lower‑case, a‑z only).
    """
    padded = START + word + END
    score = 0.0
    for i in range(len(padded) - 2):
        tri = (padded[i], padded[i+1], padded[i+2])
        score += trigram_logprob(tri, trigram_counts, bigram_counts,
                                 vocab_size, alpha)
    return score

def generate_all_candidates():
    """Yield every possible string of length 1‑5 over ALPHABET."""
    for length in range(1, MAX_LEN+1):
        for tup in itertools.product(ALPHABET, repeat=length):
            yield ''.join(tup)

def main(corpus_path: str, out_path: str = "pseudowords.txt"):
    print("[*] Building trigram counts from corpus …")
    trigram_counts, bigram_counts = build_trigram_counts(corpus_path)

    # Vocabulary for the trigram model = all symbols that can appear:
    vocab = set(ALPHABET) | {START, END}
    V = len(vocab)
    print(f"    #unique trigrams : {len(trigram_counts)}")
    print(f"    vocab size      : {V}")

    heap = []

    total = 0
    print("[*] Enumerating candidates and keeping the best ones …")
    for cand in generate_all_candidates():
        total += 1
        sc = score_word(cand, trigram_counts, bigram_counts, V, SMOOTHING_ALPHA)

        if len(heap) < N_TOP:
            heapq.heappush(heap, (sc, cand))
        else:
            # heap[0] is the worst (smallest) score in the current top set
            if sc > heap[0][0]:
                heapq.heapreplace(heap, (sc, cand))

        # occasional progress report
        if total % 1_000_000 == 0:
            print(f"    processed {total:,} candidates … "
                  f"heap size = {len(heap)} (worst score = {heap[0][0]:.2f})")

    print(f"[+] Finished scoring {total:,} candidates.")
    print(f"[+] Keeping the top {N_TOP:,} words.")

    print(f"[*] Writing results to '{out_path}' …")
    top_words = heapq.nlargest(N_TOP, heap)  # each entry = (score, word)

    with open(out_path, "w", encoding="utf-8") as out:
        for score, word in top_words:
            out.write(f"{word}\n")

    print("[+] Done!")

if __name__ == "__main__":
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        print(__doc__)
        sys.exit(1)

    corpus_file = sys.argv[1]
    output_file = sys.argv[2] if len(sys.argv) == 3 else "pseudowords.txt"
    main(corpus_file, output_file)