The buffered reader

The bulk-fetch gap against IfxPy stayed stubbornly at ~2× from Phase 36 through Phase 38. Two phases of codec optimization shrank it by a few percent each. Phase 39 — a connection-scoped buffered reader — closed it from 2.4× to ~1.05–1.15× in about thirty minutes of code plus ten minutes of architectural debugging.

This page is about both the technical change and the failure mode that hid the win for two phases.

The lever we couldn’t see

After Phase 38, profiling a 100,000-row fetch showed:

Category	Self time	% of wall clock
I/O machinery	555 ms	66%
Codec	205 ms	24%
Other	~80 ms	10%

The headline “I/O dominated” was true. The interesting half is the breakdown of that 555 ms:

Actual recv() syscalls: ~153 ms
Python wrapper overhead: ~400 ms

That ~400 ms was our own buffer abstraction — a read_exact loop that called recv() per fragment, reassembled fragments via bytes.join, and traversed two layers of cursor wrappers per call. For 100,000 rows that’s 451,402 calls to read_exact, each one paying Python wrapper cost the kernel didn’t cause.

The kernel was doing maybe 25–30 ms of work. The other 130 ms of the gap-vs-IfxPy was friction we had introduced ourselves.

The architecture pattern

Both asyncpg (in buffer.pyx) and psycopg3 (in pq.PGconn) put a single growing read buffer on the protocol/connection object. The parser indexes into it via struct.unpack_from(buf, offset) rather than slicing copies. Refills happen via one large recv(64K) rather than many small recv()s for individual fields.

Phase 39 ports that pattern to informix-driver. The state machine:

                ┌───────────────────────────────┐
                │       IfxSocket               │
                │ ─ socket: socket.socket       │
                │ ─ buf: bytearray (growable)   │
                │ ─ offset: int (read cursor)   │
                │                               │
                │  recv(64K) when buf exhausted │
                └───────────────────────────────┘
                            ▲
                            │  reads via read_exact(n)
                            │
                ┌───────────────────────────────┐
                │     SocketReader (per-PDU)    │
                │ ─ short-lived view            │
                │ ─ no buffer of its own        │
                └───────────────────────────────┘

The reader is a parser-view. The buffer outlives the reader. When the parser asks for read_short(), the reader calls socket.read_exact(2), which slices two bytes out of the bytearray at offset and advances. If the bytearray runs out, socket.recv(64K) refills it.

Result: one recv() per ~64 KB of incoming data, not per field.

The architectural mistake the first pass got wrong

The natural thing to call this is “BufferedSocketReader”. The natural thing to do is put the bytearray on the reader. That’s what I did first.

Then test_executemany_1000_rows hung. The kernel stack via cat /proc/PID/wchan said wait_woken — process blocked in recv() waiting for bytes that weren’t coming.

The bug was foreseeable, and it was architectural rather than implementational. Phase 33’s pipelined executemany sends N BIND+EXECUTE PDUs back-to-back and drains responses afterward. Each cursor read constructs a new reader instance. When my reader did recv(64K) and pulled in 600 bytes — 200 bytes for response 1, 400 bytes for response 2 — it consumed bytes for response 2 and then was destroyed. The next reader called recv(), the kernel buffer was empty, and we waited forever for bytes the kernel had already given to a dead reader.

The fix moved the buffer one level down. The bytearray and offset cursor live on IfxSocket (the connection-scoped wrapper) — readers are short-lived parser-views, the buffer outlives them.

# WRONG (first pass) — buffer scoped to reader
class BufferedSocketReader:
    def __init__(self, sock):
        self.sock = sock
        self.buf = bytearray()  # ← dies with the reader
        self.offset = 0

# RIGHT (Phase 39) — buffer scoped to connection
class IfxSocket:
    def __init__(self, sock):
        self.sock = sock
        self._read_buf = bytearray()  # ← outlives all readers
        self._read_offset = 0

    def read_exact(self, n):
        if self._read_offset + n > len(self._read_buf):
            self._refill()
        ...

asyncpg and psycopg3 both place the buffer on the protocol/connection object. The architectural template was sitting in front of me before I started; I built the wrong shape anyway because “buffered reader” implies the buffer is on the reader.

It is not. The reader is a view. The buffer is state.

The numbers

A/B-measured against the same Docker container, warmed cache, only the env flag differing:

Workload	Phase 38	Phase 39	Δ
`select_scaling_1000`	2.901 ms	1.716 ms	−41%
`select_scaling_10000`	24.317 ms	16.084 ms	−34%
`select_scaling_100000`	250.363 ms	168.982 ms	−32%

Re-running the IfxPy comparison after Phase 39:

Workload	IfxPy 2.0.7 (C)	informix-driver Phase 39	Ratio
`select_scaling_1000`	1.637 ms	1.716 ms	1.05×
`select_scaling_10000`	15.07 ms	16.08 ms	1.07×
`select_scaling_100000`	147.4 ms	169.0 ms	1.15×

The 2.4× steady-state gap that existed before Phase 37 is now within 5–15% of the C driver, and the lower bound may already be within IfxPy’s own measurement noise (its IQR on the 100k workload is 21%; ours is 0.2%).

What the feature flag does

The buffered reader ships enabled by default in version 2026.05.05.12. To opt out (debugging, regressing a workload, A/B-measuring your own code):

IFX_BUFFERED_READER=0 python my_app.py

The flag is read once at connection construction. Existing connections in a pool aren’t affected by changing the env at runtime — close and reopen the pool to flip behavior.

What we learned

The general pattern: what’s visible gets optimization attention; what’s invisible gets written off as irreducible.

The codec is visible — there’s a loop, a _decode_varchar function, a struct.unpack call. You can read the inner loop and reason about it. Phases 37 and 38 attacked it, both got modest wins.

The I/O machinery looked invisible. _socket.read_exact is eight lines. The cursor’s _SocketReader wrapper is twelve. The framing reads — read_short, read_int, read_exact(payload_size) — are one-liners. What could be slow about that?

It was carrying ~30% of total wall time. Two phases of changelogs implicitly blamed “the protocol” for the remaining gap. The actual culprit was a few lines of bytes.join in a wrapper from Phase 1 that nobody had revisited.

The lesson is small and easy to state: a profile turns vibes into an attack surface. Write the closing paragraph after you’ve measured, not before.

Architecture overview → — where the buffered reader sits in the layer stack.
Phase log → — the full progression from Phase 1 through Phase 39+.
“The 156 Milliseconds I’d Been Hand-Waving About” — Claude’s reflection on the session in which Phase 39 shipped, including the two pushbacks that triggered the work.