hama on-device NLP

APIs

This section documents the public G2P, pronunciation correction, and phoneme ASR interfaces exposed by the hama Python package and the hama-js package via hama-js/g2p, hama-js/asr, hama-js/g2p/browser, hama-js/asr/browser, and hama-js/browser.

G2P

Quickstart

from hama import G2PModel

model = G2PModel()
result = model.predict(
  "Really? What's the orbital velocity of the moon?",
  preserve_literals="punct",
)

print(result.ipa)
print(result.display_ipa)
for alignment in result.alignments:
  print(alignment.phoneme, alignment.char_index)

G2P uses split ONNX assets by default. If you explicitly pass a single-file model_path or modelPath, both runtimes still support the legacy fallback path.

Signatures

class G2PModel(
  model_path: Optional[PathLike] = None,
  encoder_model_path: Optional[PathLike] = None,
  decoder_step_model_path: Optional[PathLike] = None,
  vocab_path: Optional[PathLike] = None,
  max_input_len: int = 128,
  max_output_len: int = 32,
  providers: Optional[Sequence[str]] = None,
)

model.predict(
  text,
  split_delimiter=r"\s+",
  output_delimiter=" ",
  preserve_literals="none" | "punct",
)

Return values

  • Python returns G2PResult with ipa and display_ipa plus alignments: list[G2PAlignment].

  • TypeScript returns { ipa: string; displayIpa: string; alignments: Alignment[] }.

  • Alignment fields are phoneme, phoneme_index / phonemeIndex, and char_index / charIndex.

Notes

  • Python applies Unicode casefolding; TypeScript uses toLocaleLowerCase(“und”).
  • Whitespace is skipped during tokenization, so alignments map back to non-whitespace characters.
  • For whitespace-only input, the alignment sentinel is -1.
  • display_ipa / displayIpa equals canonical IPA by default.
  • Set preserve_literals=“punct” or preserveLiterals: “punct” to keep punctuation in rendered output.
  • Browser G2P is available through hama-js/g2p/browser or the aggregate hama-js/browser export.

Pronunciation correction

Quickstart

from hama import pronunciation_scan, pronunciation_replace

text = "we met (jon smyth), and later spoke with o reilly media yesterday."
terms = [{"text": "John Smythe"}, {"text": "O'Reilly Media"}]

scan = pronunciation_scan(text, terms, {"return_phonemes": True})
result = pronunciation_replace(
  text,
  terms,
  {"return_phonemes": True, "include_discarded": True},
)

print(scan["matches"])
print(result["text"])

Signatures

pronunciation_scan(
  text: str,
  terms: Sequence[str | PronunciationTerm],
  options: PronunciationScanOptions | None = None,
) -> PronunciationScanResult

pronunciation_replace(
  text: str,
  terms: Sequence[str | PronunciationTerm],
  options: PronunciationReplaceOptions | None = None,
) -> PronunciationReplaceResult

Return values

  • Scan returns matches with the matched text, canonical replacement, original start_char / end_char or startChar / endChar, score, and optional phoneme details.

  • Replace returns corrected text, applied patches, discarded patches, and stats for applied, ambiguous, overlap, and duplicate counts.

  • Applied patches also include output offsets via output_start_char / outputStartChar and output_end_char / outputEndChar.

Notes

  • Offsets always refer to the original input string.
  • Matching is token-boundary only, so larger words are not rewritten by substring match.
  • Matching is pronunciation-first, with text similarity as a secondary score.
  • Replacement rewrites the original text in one pass, preserving surrounding punctuation and spacing.
  • Ambiguous and overlapping candidates can be surfaced through discarded patches.

ASR

Quickstart

from hama import ASRModel

model = ASRModel()
result = model.transcribe_file("sample.wav")
print(result.phoneme_text)
print(result.word_phoneme_text)

ASR is waveform-input only and uses the packaged asr_waveform_fp16.onnx asset. Browser ASR uses the same model contract, loaded explicitly via modelUrl.

Signatures

class ASRModel(
  model_path: Optional[PathLike] = None,
  vocab_path: Optional[PathLike] = None,
  decode: Optional[ASRDecodeConfig] = None,
  providers: Optional[Sequence[str]] = None,
  model_sample_rate: int = 16000,
)

model.transcribe_file("sample.wav")
model.transcribe_waveform(waveform, sample_rate)

Return values

  • Python ASRResult: phonemes, phoneme_text, word_phoneme_text, token_ids, frame_token_ids, num_frames.

  • TypeScript ASRResult: phonemes, phonemeText, wordPhonemeText, tokenIds, frameTokenIds, numFrames.

Common usage patterns

# Reuse model instances across requests.
g2p = G2PModel()
asr = ASRModel()

# Explicit split G2P assets.
custom_g2p = G2PModel(
  encoder_model_path="encoder.onnx",
  decoder_step_model_path="decoder_step.onnx",
  vocab_path="g2p_vocab.json",
)

# Explicit ASR asset.
custom_asr = ASRModel(model_path="asr_waveform_fp16.onnx")

Reference demo

The browser demo powering this site lives in src/scripts/g2p-demo.ts. It exposes mountBrowserDemo(), which wires DOM elements to the public browser G2P runtime.

<section id="g2p-demo">
<textarea data-demo-input placeholder="Type text…">안녕하세요</textarea>
<button data-demo-chip data-value="Alignment gives explainability">Sample</button>
<span data-demo-status-dot></span>
<p data-demo-status-text>Waiting for input.</p>
<small data-demo-status-note>(Everything stays on-device.)</small>
<output data-demo-ipa>—</output>
<div data-demo-alignments></div>
</section>

Shared helpers

  • Python: split_text_to_jamo, join_jamo_tokens, decode_ctc_tokens.
  • TypeScript: splitTextToJamo, joinJamoTokens, decodeCtcTokens.

Alignments map each phoneme back to the original character index. Use them to highlight pronunciations or validate text/phoneme correspondence in UI.