Python

Writing a Huffman Compressor: Prefix Codes, Bit Packing, and the Header Problem

Compression is one of those topics that sits in a peculiar spot: everyone uses it constantly, almost nobody writes one. The algorithms behind gzip, zip, zstd get treated as a black box you pipe bytes through. Huffman coding is the smallest interesting member of that family, and writing one by hand is the cleanest way to surface the parts of compression that aren’t really about compression at all: how you pack variable-length codes into fixed-size bytes, how the decoder finds the table, and what “the end of the file” even means when your last byte is half real data and half padding. ...

Writing a JSON Parser From Scratch: Lexer, Recursive Descent, and the Cases You Forget

I’ve been reading JSON for years without ever writing the code that parses it. That’s a strange gap: it’s a format we touch most days, and the one whose internals I’d thought about least. So I sat down one evening and wrote one from scratch in Python, no json.loads, no regex shortcuts, just the spec and a blank file. The structure is the textbook one: a lexer that turns characters into tokens, and a recursive-descent parser that turns tokens into a tree. What I want to write about is less the structure itself (that part is well-trodden) and more the places where the spec is sharper than it looks, and where a reasonable-seeming shortcut quietly produces a parser that accepts invalid input. ...

Reimplementing wc: Where Text Abstractions Leak

Reimplementing wc is less about counting and more about confronting the places where “text” stops being a clean abstraction. The interface is trivial; the semantics are not. Most of the interesting behavior lives at the boundary between bytes, encodings, and Unix I/O conventions. Below is a minimal clone. It’s deliberately scoped: correct along a few dimensions, incomplete along others. #!/usr/bin/env python3 import argparse, sys def get_bytes(raw): return len(raw) def get_lines(text): return text.count('\n') def get_words(text): return len(text.split()) def get_chars(text): return len(text) def main(): parser = argparse.ArgumentParser() parser.add_argument('file_path', nargs='?') parser.add_argument('-c', action="store_true") parser.add_argument('-l', action="store_true") parser.add_argument('-w', action="store_true") parser.add_argument('-m', action="store_true") args = parser.parse_args() if args.file_path: with open(args.file_path, "rb") as f: raw = f.read() label = args.file_path elif not sys.stdin.isatty(): raw = sys.stdin.buffer.read() label = None else: return text = raw.decode('utf-8') if not any([args.c, args.l, args.w, args.m]): args.c = args.l = args.w = True results = [] if args.l: results.append(str(get_lines(text))) if args.w: results.append(str(get_words(text))) if args.c: results.append(str(get_bytes(raw))) if args.m: results.append(str(get_chars(text))) parts = ' '.join(results) print(f"{parts} {label}" if label else parts) if __name__ == "__main__": main() Bytes vs characters is an interface boundary, not trivia The -c / -m distinction is where most implementations quietly diverge from spec. ...