PodcastInsights is a Python-based tool for processing podcast RSS feeds, downloading episodes, and extracting transcripts for further analysis. This project aims to create a foundation for building insights and analytics from podcast content.
- Parse podcast RSS feeds to extract episode metadata
- Download podcast episodes from feeds
- Transcribe audio to text using Whisper speech recognition
- Store organized data with proper file management
from podcast_processor import PodcastProcessor
# Initialize the processor
processor = PodcastProcessor()
# Process a podcast feed (with a limit of 5 episodes)
results = processor.process_feed("https://feeds.megaphone.fm/darknetdiaries", max_episodes=5)
# Print results
for result in results:
print(f"\nTitle: {result['title']}")
print(f"Audio: {result['audio_path']}")
print(f"Transcript: {result['transcript_path']}")- Python 3.12+
- FFmpeg (required for audio processing)
# Clone the repository
git clone https://github.com/yourusername/podcast-insights.git
cd podcast-insights
# Create the conda environment
conda env create -f environment.yaml
# Activate the environment
conda activate podcast-processor# Clone the repository
git clone https://github.com/yourusername/podcast-insights.git
cd podcast-insights
# Create a virtual environment
python -m venv venv
# Activate the environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtpodcast-insights/
├── podcast_processor/ # Core package
│ ├── __init__.py
│ ├── processor.py # Main processing functionality
│ ├── feed_parser.py # RSS feed parsing utilities
│ ├── audio_handler.py # Audio download and processing
│ └── transcriber.py # Speech-to-text transcription
├── scripts/ # Utility scripts
├── data/ # Default data storage
│ ├── audio/ # Downloaded audio files
│ └── transcripts/ # Generated transcripts
└── tests/ # Test suite
Set custom paths and preferences by editing the config.yaml file:
output_directory: "data"
whisper_model: "base" # Options: tiny, base, small, medium, large
max_episodes: 10 # Default limit for batch processing- Episode metadata extraction (show notes, timestamps, etc.)
- Basic content analysis (topic identification, keyword extraction)
- Simple web interface for browsing podcasts and transcripts
- Speaker diarization (identifying different speakers in transcripts)
- Topic segmentation (dividing episodes into thematic sections)
- Sentiment analysis of podcast content
- Cross-episode thematic analysis
- Automated summarization and key point extraction
- Content-based recommendation engine
- Transcript search and indexing
- Potential integration with LLM-based RAG systems for intelligent querying
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.