A comprehensive guide and toolkit for fine-tuning the Moonshine ASR model for custom languages and domains.
This repository provides everything you need to fine-tune Moonshine, a lightweight and efficient automatic speech recognition (ASR) model with only 27M parameters, achieving performance comparable to much larger models.
What you'll learn:
- โ How to prepare your dataset for fine-tuning
- โ Training with curriculum learning and schedule-free optimization
- โ Intelligent audio segmentation for better data quality
- โ Evaluation and inference with production-ready scripts
- โ Live transcription with Voice Activity Detection
- โ ONNX export for 10-30% faster inference
- โ Complete deployment pipeline
moonshine-tiny-fr - Fine-tuned French ASR model ready to use!
Fine-tuned using this guide on the Multilingual LibriSpeech French dataset:
- WER: 21.8% on test set
- Model Size: Only 27M parameters
- Inference Speed: RTF 0.11x (9x faster than real-time on CPU)
- Training: 8,000 steps with curriculum learning
Try it now:
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition", model="Cornebidouil/moonshine-tiny-fr")
result = transcriber("french_audio.wav")
print(result['text'])โก๏ธ View Model Card on HuggingFace
# Install dependencies
pip install transformers torch torchaudio
# Use the model
python
>>> from transformers import pipeline
>>> transcriber = pipeline("automatic-speech-recognition", model="Cornebidouil/moonshine-tiny-fr")
>>> result = transcriber("your_french_audio.wav")
>>> print(result['text'])# Clone the repository
git clone https://github.com/pierre-cheneau/finetune-moonshine-asr.git
cd finetune-moonshine-asr
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Optional: Live transcription support
pip install -r requirements-live.txt# 1. Prepare your dataset (HuggingFace dataset format)
python scripts/intelligent_segmentation.py \
--dataset facebook/multilingual_librispeech \
--language french \
--output ./data/mls_french_segmented
# 2. Train the model
python train.py --config configs/mls_french_no_curriculum.yaml
# 3. Evaluate on test set
python scripts/evaluate.py \
--model results-moonshine-fr/checkpoint-6000 \
--dataset ./data/test \
--split test
# 4. Run inference
python scripts/inference.py \
--model results-moonshine-fr/checkpoint-6000 \
--audio sample.wavThis guide was used to create moonshine-tiny-fr, a production-ready French ASR model.
| Metric | Value |
|---|---|
| Word Error Rate (WER) | 21.8% |
| Character Error Rate (CER) | ~10% |
| Inference Speed (CPU) | 9x faster than real-time |
| Model Size | 27M parameters |
| Training Time | ~24 hours on single GPU |
Basic Transcription:
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition", model="Cornebidouil/moonshine-tiny-fr")
result = transcriber("french_audio.wav")
print(result['text'])Batch Processing:
from pathlib import Path
audio_files = Path("./audio").glob("*.wav")
for audio in audio_files:
result = transcriber(str(audio))
print(f"{audio.name}: {result['text']}")Live Transcription:
# Clone this repo and use inference.py
python scripts/inference.py --model Cornebidouil/moonshine-tiny-fr --live- Installation Guide - Complete setup instructions
- Training Guide - Step-by-step training tutorial
- Dataset Preparation - Prepare your audio data
- Inference Guide - Single file, batch, and live inference
- Live Transcription - Real-time transcription with VAD
- ONNX Runtime - 10-30% faster inference
| Script | Purpose |
|---|---|
train.py |
Main training script with curriculum learning |
scripts/inference.py |
Production inference (batch, live, ONNX) |
scripts/evaluate.py |
WER/CER evaluation on test sets |
scripts/convert_for_deployment.py |
Complete deployment pipeline |
| Script | Purpose |
|---|---|
scripts/intelligent_segmentation.py |
Segment long audio with forced alignment |
scripts/extract_samples.py |
Extract test samples from datasets |
scripts/checkpoint_to_dataset.py |
Create datasets from training checkpoints |
finetune-moonshine-asr/
โโโ README.md # This file
โโโ requirements.txt # Python dependencies
โโโ requirements-live.txt # Optional live mode dependencies
โโโ train.py # Main training script
โ
โโโ docs/ # Documentation
โ โโโ INSTALLATION.md
โ โโโ TRAINING_GUIDE.md
โ โโโ INFERENCE_GUIDE.md
โ โโโ LIVE_MODE_GUIDE.md
โ โโโ ONNX_MODE_GUIDE.md
โ
โโโ scripts/ # Utility scripts
โ โโโ inference.py
โ โโโ evaluate.py
โ โโโ convert_for_deployment.py
โ โโโ intelligent_segmentation.py
โ โโโ extract_samples.py
โ
โโโ configs/ # Training configurations
โ โโโ mls_french_no_curriculum.yaml
โ โโโ example_curriculum.yaml
โ
โโโ examples/ # Example notebooks
โ โโโ fine_tune_moonshine_curriculum.ipynb
โ
โโโ moonshine_ft/ # Fine-tuning library
โโโ __init__.py
โโโ data_loader.py
โโโ trainer.py
โโโ configs/
pip install -r requirements.txt# Option A: Use intelligent segmentation (recommended)
python scripts/intelligent_segmentation.py \
--dataset facebook/multilingual_librispeech \
--language french \
--output ./data/mls_french_segmented \
--max-duration 10.0 \
--min-duration 1.0
# Option B: Use pre-segmented dataset
# Just specify the dataset in your config fileCreate or edit configs/my_french_model.yaml:
# Dataset configuration
dataset:
name: "facebook/multilingual_librispeech"
language: "french"
train_split: "train"
test_split: "test"
# Training configuration
training:
output_dir: "./results-moonshine-fr"
num_train_epochs: 3
per_device_train_batch_size: 16
learning_rate: 5e-5
warmup_steps: 500
# Model configuration
model:
name: "UsefulSensors/moonshine-tiny"
# Optimizer
optimizer:
type: "schedulefree_adamw"
betas: [0.9, 0.999]
weight_decay: 0.01python train.py --config configs/my_french_model.yamlMonitor with TensorBoard:
tensorboard --logdir results-moonshine-fr/runspython scripts/evaluate.py \
--model results-moonshine-fr/checkpoint-best \
--dataset facebook/multilingual_librispeech \
--language french \
--split test# Single file
python scripts/inference.py \
--model results-moonshine-fr/checkpoint-best \
--audio my_audio.wav
# Live transcription
python scripts/inference.py \
--model results-moonshine-fr/checkpoint-best \
--live
# ONNX (faster)
python scripts/convert_for_deployment.py \
--model results-moonshine-fr/checkpoint-best \
--output moonshine-fr-onnx
python scripts/inference.py \
--model moonshine-fr-onnx/onnx \
--audio my_audio.wav \
--use-manual-onnxTrain with progressive difficulty for better convergence:
curriculum:
enabled: true
stages:
- duration: 2000 # steps
max_audio_length: 5.0
description: "Short audio clips"
- duration: 3000
max_audio_length: 10.0
description: "Medium audio clips"
- duration: 3000
max_audio_length: 20.0
description: "Full-length audio"Use Whisper V3 + forced alignment for optimal segmentation:
python scripts/intelligent_segmentation.py \
--dataset your/dataset \
--language french \
--output ./data/segmented \
--use-whisper-v3 \
--alignment-method "forced" \
--max-duration 10.0Modern optimizer without learning rate schedules:
optimizer:
type: "schedulefree_adamw"
learning_rate: 5e-5
betas: [0.9, 0.999]
weight_decay: 0.01
warmup_steps: 500- Use curriculum learning for better convergence
- Start with
batch_size=16, increase if you have more GPU memory - Use schedule-free AdamW optimizer (no LR scheduling needed)
- Monitor WER on validation set, save best checkpoint
- CPU: Use ONNX manual mode (20-30% faster)
- GPU: Use PyTorch with FP16 (fastest)
- Live: Enable VAD for better segmentation
- Batch: Process multiple files at once for efficiency
- Convert to ONNX for production
- Use merged decoder for KV cache efficiency
- Binary tokenizer for faster loading
- ORT optimization for additional speedup
Q: Out of memory during training
# Reduce batch size
per_device_train_batch_size: 8 # instead of 16
# Or enable gradient accumulation
gradient_accumulation_steps: 2Q: Model not converging
# Try curriculum learning
# Start with shorter audio clips
# Increase warmup steps
warmup_steps: 1000Q: Transcriptions are truncated
# Already fixed in our scripts!
# Uses: max_new_tokens = audio_duration * 5Q: Slow inference on CPU
# Use ONNX mode
python scripts/inference.py --model model-onnx --audio audio.wav --use-manual-onnxQ: No microphone detected
# Check available devices
python -c "import sounddevice as sd; print(sd.query_devices())"
# Install sounddevice
pip install sounddeviceContributions are welcome! Areas for improvement:
- Multi-language support examples
- More curriculum learning strategies
- Quantization support (INT8)
- Speaker diarization integration
- Punctuation restoration
- Docker deployment examples
This project is licensed under the MIT License - see the LICENSE file for details.
- Useful Sensors for the Moonshine model
- HuggingFace for the Transformers library
- Schedule-Free Learning for the optimizer
- Multilingual LibriSpeech for training data
If you use this guide or the fine-tuned model in your research, please cite:
@misc{cheneau2026moonshine-finetune,
author = {Pierre Chรฉneau (Cornebidouil)},
title = {Moonshine ASR Fine-Tuning Guide},
year = {2026},
publisher = {GitHub},
url = {https://github.com/pierre-cheneau/finetune-moonshine-asr}
}@misc{cheneau2026moonshine-tiny-fr,
author = {Pierre Chรฉneau (Cornebidouil)},
title = {Moonshine-Tiny-FR: Fine-tuned French Speech Recognition},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/Cornebidouil/moonshine-tiny-fr}
}@misc{jeffries2024moonshinespeechrecognitionlive,
title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
year={2024},
eprint={2410.15608},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2410.15608}
}- moonshine-tiny-fr - Fine-tuned French model (ready to use!)
- Multilingual LibriSpeech - Used for French training
- Common Voice - Alternative dataset
For questions or issues:
- Website: pcheneau.fr
- Github: @pierre-cheneau
- HuggingFace: @Cornebidouil
- Discord: HogwartsLegacySpellCaster (Hogwarts Legacy Spell Recognition project's discord)
Made with โค๏ธ for the ASR community