← Back to blog

6/28/2026

hama v1.6.0: alignments for ASR and P2G

Version 1.6.0 adds alignment outputs to two more modalities. G2P has always returned character-level alignments; now phoneme ASR can place each phoneme on a rough timeline, and P2G can point each output character back at the phoneme it came from. Existing outputs are unchanged, and the Python and TypeScript runtimes stay byte-for-byte identical.

  • ASR time spans: ASRModel.phoneme_spans(result) in Python and model.phonemeSpans(result) in TypeScript (plus the standalone ctc_phoneme_spans / ctcPhonemeSpans) return an approximate start/end time and frame range for every emitted phoneme, derived from the CTC frame alignment. CTC is peaky, so these are coarse acoustic spans rather than exact boundaries — useful for highlighting, scrubbing, or rough segmentation. It is pure post-processing: no model change.
  • P2G alignments: P2GResult.alignments maps each generated grapheme token back to the source phoneme it most attends to (the decoder's last-layer attention, captured during the cached greedy decode with no extra forward pass). The alignment is validated against PyTorch's own attention weights on a committed fixture, and the Python (native) and TypeScript (WASM) runtimes produce bit-identical results.
  • Performance, measured on a laptop CPU: G2P runs at roughly 10 ms per word, phoneme ASR at about 24x faster than real time, and P2G decodes around 1,500 tokens per second. Everything runs on-device — no servers, no onnxruntime.
  • Public APIs are unchanged; existing G2P / ASR / P2G code keeps working. Python hama and TypeScript hama-js are aligned on version 1.6.0.