Skip to content

Complete guide and toolkit for fine-tuning Moonshine ASR models for custom languages. Includes French example with 21.8% WER.

License

Notifications You must be signed in to change notification settings

pierre-cheneau/finetune-moonshine-asr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Moonshine ASR Fine-Tuning Guide

A comprehensive guide and toolkit for fine-tuning the Moonshine ASR model for custom languages and domains.

License: MIT Python 3.8+ HuggingFace Model

๐ŸŽฏ Overview

This repository provides everything you need to fine-tune Moonshine, a lightweight and efficient automatic speech recognition (ASR) model with only 27M parameters, achieving performance comparable to much larger models.

What you'll learn:

  • โœ… How to prepare your dataset for fine-tuning
  • โœ… Training with curriculum learning and schedule-free optimization
  • โœ… Intelligent audio segmentation for better data quality
  • โœ… Evaluation and inference with production-ready scripts
  • โœ… Live transcription with Voice Activity Detection
  • โœ… ONNX export for 10-30% faster inference
  • โœ… Complete deployment pipeline

๐ŸŽค Pre-trained Model Available

moonshine-tiny-fr - Fine-tuned French ASR model ready to use!

Fine-tuned using this guide on the Multilingual LibriSpeech French dataset:

  • WER: 21.8% on test set
  • Model Size: Only 27M parameters
  • Inference Speed: RTF 0.11x (9x faster than real-time on CPU)
  • Training: 8,000 steps with curriculum learning

Try it now:

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="Cornebidouil/moonshine-tiny-fr")
result = transcriber("french_audio.wav")
print(result['text'])

โžก๏ธ View Model Card on HuggingFace

๐Ÿš€ Quick Start

Option 1: Use Pre-trained French Model (Fastest)

# Install dependencies
pip install transformers torch torchaudio

# Use the model
python
>>> from transformers import pipeline
>>> transcriber = pipeline("automatic-speech-recognition", model="Cornebidouil/moonshine-tiny-fr")
>>> result = transcriber("your_french_audio.wav")
>>> print(result['text'])

๐Ÿ“ฅ Download Model

Option 2: Fine-Tune Your Own Model

# Clone the repository
git clone https://github.com/pierre-cheneau/finetune-moonshine-asr.git
cd finetune-moonshine-asr

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Optional: Live transcription support
pip install -r requirements-live.txt

Fine-Tune Your First Model

# 1. Prepare your dataset (HuggingFace dataset format)
python scripts/intelligent_segmentation.py \
    --dataset facebook/multilingual_librispeech \
    --language french \
    --output ./data/mls_french_segmented

# 2. Train the model
python train.py --config configs/mls_french_no_curriculum.yaml

# 3. Evaluate on test set
python scripts/evaluate.py \
    --model results-moonshine-fr/checkpoint-6000 \
    --dataset ./data/test \
    --split test

# 4. Run inference
python scripts/inference.py \
    --model results-moonshine-fr/checkpoint-6000 \
    --audio sample.wav

๐ŸŒŸ Example: Fine-Tuned French Model

This guide was used to create moonshine-tiny-fr, a production-ready French ASR model.

Model Performance

Metric Value
Word Error Rate (WER) 21.8%
Character Error Rate (CER) ~10%
Inference Speed (CPU) 9x faster than real-time
Model Size 27M parameters
Training Time ~24 hours on single GPU

Using the Model

Basic Transcription:

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="Cornebidouil/moonshine-tiny-fr")
result = transcriber("french_audio.wav")
print(result['text'])

Batch Processing:

from pathlib import Path

audio_files = Path("./audio").glob("*.wav")
for audio in audio_files:
    result = transcriber(str(audio))
    print(f"{audio.name}: {result['text']}")

Live Transcription:

# Clone this repo and use inference.py
python scripts/inference.py --model Cornebidouil/moonshine-tiny-fr --live

๐Ÿ“– Full Model Documentation

๐Ÿ“š Documentation

Getting Started

Advanced Features

๐Ÿ› ๏ธ Scripts

Core Scripts

Script Purpose
train.py Main training script with curriculum learning
scripts/inference.py Production inference (batch, live, ONNX)
scripts/evaluate.py WER/CER evaluation on test sets
scripts/convert_for_deployment.py Complete deployment pipeline

Utility Scripts

Script Purpose
scripts/intelligent_segmentation.py Segment long audio with forced alignment
scripts/extract_samples.py Extract test samples from datasets
scripts/checkpoint_to_dataset.py Create datasets from training checkpoints

๐Ÿ“ฆ Repository Structure

finetune-moonshine-asr/
โ”œโ”€โ”€ README.md                   # This file
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ”œโ”€โ”€ requirements-live.txt       # Optional live mode dependencies
โ”œโ”€โ”€ train.py                    # Main training script
โ”‚
โ”œโ”€โ”€ docs/                       # Documentation
โ”‚   โ”œโ”€โ”€ INSTALLATION.md
โ”‚   โ”œโ”€โ”€ TRAINING_GUIDE.md
โ”‚   โ”œโ”€โ”€ INFERENCE_GUIDE.md
โ”‚   โ”œโ”€โ”€ LIVE_MODE_GUIDE.md
โ”‚   โ””โ”€โ”€ ONNX_MODE_GUIDE.md
โ”‚
โ”œโ”€โ”€ scripts/                    # Utility scripts
โ”‚   โ”œโ”€โ”€ inference.py
โ”‚   โ”œโ”€โ”€ evaluate.py
โ”‚   โ”œโ”€โ”€ convert_for_deployment.py
โ”‚   โ”œโ”€โ”€ intelligent_segmentation.py
โ”‚   โ””โ”€โ”€ extract_samples.py
โ”‚
โ”œโ”€โ”€ configs/                    # Training configurations
โ”‚   โ”œโ”€โ”€ mls_french_no_curriculum.yaml
โ”‚   โ””โ”€โ”€ example_curriculum.yaml
โ”‚
โ”œโ”€โ”€ examples/                   # Example notebooks
โ”‚   โ””โ”€โ”€ fine_tune_moonshine_curriculum.ipynb
โ”‚
โ””โ”€โ”€ moonshine_ft/              # Fine-tuning library
    โ”œโ”€โ”€ __init__.py
    โ”œโ”€โ”€ data_loader.py
    โ”œโ”€โ”€ trainer.py
    โ””โ”€โ”€ configs/

๐ŸŽ“ Tutorial: Fine-Tune for French

Step 1: Install Dependencies

pip install -r requirements.txt

Step 2: Prepare Dataset

# Option A: Use intelligent segmentation (recommended)
python scripts/intelligent_segmentation.py \
    --dataset facebook/multilingual_librispeech \
    --language french \
    --output ./data/mls_french_segmented \
    --max-duration 10.0 \
    --min-duration 1.0

# Option B: Use pre-segmented dataset
# Just specify the dataset in your config file

Step 3: Configure Training

Create or edit configs/my_french_model.yaml:

# Dataset configuration
dataset:
  name: "facebook/multilingual_librispeech"
  language: "french"
  train_split: "train"
  test_split: "test"

# Training configuration
training:
  output_dir: "./results-moonshine-fr"
  num_train_epochs: 3
  per_device_train_batch_size: 16
  learning_rate: 5e-5
  warmup_steps: 500

# Model configuration
model:
  name: "UsefulSensors/moonshine-tiny"

# Optimizer
optimizer:
  type: "schedulefree_adamw"
  betas: [0.9, 0.999]
  weight_decay: 0.01

Step 4: Train

python train.py --config configs/my_french_model.yaml

Monitor with TensorBoard:

tensorboard --logdir results-moonshine-fr/runs

Step 5: Evaluate

python scripts/evaluate.py \
    --model results-moonshine-fr/checkpoint-best \
    --dataset facebook/multilingual_librispeech \
    --language french \
    --split test

Step 6: Inference

# Single file
python scripts/inference.py \
    --model results-moonshine-fr/checkpoint-best \
    --audio my_audio.wav

# Live transcription
python scripts/inference.py \
    --model results-moonshine-fr/checkpoint-best \
    --live

# ONNX (faster)
python scripts/convert_for_deployment.py \
    --model results-moonshine-fr/checkpoint-best \
    --output moonshine-fr-onnx

python scripts/inference.py \
    --model moonshine-fr-onnx/onnx \
    --audio my_audio.wav \
    --use-manual-onnx

๐Ÿ”ฌ Advanced Features

Curriculum Learning

Train with progressive difficulty for better convergence:

curriculum:
  enabled: true
  stages:
    - duration: 2000  # steps
      max_audio_length: 5.0
      description: "Short audio clips"

    - duration: 3000
      max_audio_length: 10.0
      description: "Medium audio clips"

    - duration: 3000
      max_audio_length: 20.0
      description: "Full-length audio"

Intelligent Audio Segmentation

Use Whisper V3 + forced alignment for optimal segmentation:

python scripts/intelligent_segmentation.py \
    --dataset your/dataset \
    --language french \
    --output ./data/segmented \
    --use-whisper-v3 \
    --alignment-method "forced" \
    --max-duration 10.0

Schedule-Free Optimization

Modern optimizer without learning rate schedules:

optimizer:
  type: "schedulefree_adamw"
  learning_rate: 5e-5
  betas: [0.9, 0.999]
  weight_decay: 0.01
  warmup_steps: 500

๐Ÿ“ˆ Performance Tips

Training

  • Use curriculum learning for better convergence
  • Start with batch_size=16, increase if you have more GPU memory
  • Use schedule-free AdamW optimizer (no LR scheduling needed)
  • Monitor WER on validation set, save best checkpoint

Inference

  • CPU: Use ONNX manual mode (20-30% faster)
  • GPU: Use PyTorch with FP16 (fastest)
  • Live: Enable VAD for better segmentation
  • Batch: Process multiple files at once for efficiency

Deployment

  • Convert to ONNX for production
  • Use merged decoder for KV cache efficiency
  • Binary tokenizer for faster loading
  • ORT optimization for additional speedup

๐Ÿ› Troubleshooting

Training Issues

Q: Out of memory during training

# Reduce batch size
per_device_train_batch_size: 8  # instead of 16

# Or enable gradient accumulation
gradient_accumulation_steps: 2

Q: Model not converging

# Try curriculum learning
# Start with shorter audio clips
# Increase warmup steps
warmup_steps: 1000

Inference Issues

Q: Transcriptions are truncated

# Already fixed in our scripts!
# Uses: max_new_tokens = audio_duration * 5

Q: Slow inference on CPU

# Use ONNX mode
python scripts/inference.py --model model-onnx --audio audio.wav --use-manual-onnx

Live Mode Issues

Q: No microphone detected

# Check available devices
python -c "import sounddevice as sd; print(sd.query_devices())"

# Install sounddevice
pip install sounddevice

๐Ÿค Contributing

Contributions are welcome! Areas for improvement:

  • Multi-language support examples
  • More curriculum learning strategies
  • Quantization support (INT8)
  • Speaker diarization integration
  • Punctuation restoration
  • Docker deployment examples

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

๐Ÿ“š Citation

If you use this guide or the fine-tuned model in your research, please cite:

This Fine-Tuning Guide

@misc{cheneau2026moonshine-finetune,
  author = {Pierre Chรฉneau (Cornebidouil)},
  title = {Moonshine ASR Fine-Tuning Guide},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/pierre-cheneau/finetune-moonshine-asr}
}

Fine-Tuned French Model

@misc{cheneau2026moonshine-tiny-fr,
  author = {Pierre Chรฉneau (Cornebidouil)},
  title = {Moonshine-Tiny-FR: Fine-tuned French Speech Recognition},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Cornebidouil/moonshine-tiny-fr}
}

Original Moonshine

@misc{jeffries2024moonshinespeechrecognitionlive,
  title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
  author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
  year={2024},
  eprint={2410.15608},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2410.15608}
}

๐Ÿ”— Related Resources

Our Models

Original Moonshine

Datasets

Contact

For questions or issues:


Made with โค๏ธ for the ASR community

About

Complete guide and toolkit for fine-tuning Moonshine ASR models for custom languages. Includes French example with 21.8% WER.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages