(cherry picked from commit 255cc192690654535c4ebeecec1ef6500943f42e)
Songyi Course Content Scraper & Transcriber
A Python-based automated system for scraping, downloading, and transcribing online course content from the Bandu API. The system converts course materials into Hugo-compatible markdown files with audio/video transcriptions.
Features
- Course Data Management: Fetches and stores course metadata in SQLite/PostgreSQL databases
- Multi-threaded Downloads: Efficiently downloads course materials (audio, video, images, text) using aria2c
- Audio Processing: Automatically combines multiple audio segments into single MP3 files using FFmpeg
- Speech-to-Text: Transcribes audio/video content using FunASR/SenseVoice models
- Hugo Integration: Generates markdown files with proper frontmatter for Hugo static sites
- Smart Caching: Stores transcriptions in database to avoid redundant processing
Prerequisites
System Dependencies
- Python 3.12+
- FFmpeg
- aria2c
Python Dependencies
See requirements.txt for the full list. Key packages include:
- requests
- gradio_client
- funasr
- librosa
- moviepy
- pymongo
- psycopg2-binary
Installation
- Clone the repository:
git clone <repository-url>
cd songyi
- Create and activate virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install Python dependencies:
pip install -r requirements.txt
- Install system dependencies:
# Ubuntu/Debian
sudo apt-get install ffmpeg aria2
# macOS
brew install ffmpeg aria2
- Create configuration file:
cp config.ini.example config.ini
# Edit config.ini with your settings
Configuration
Create a config.ini file with the following structure:
[DEFAULT]
authorization_token = your_bearer_token_here
limit = 100
offset = 0
sort = newest-first
max_download_threads = 5
max_retry_attempts = 3
download_id = 1
[POSTGRES]
dbname = your_db_name
user = your_username
password = your_password
host = localhost
port = 5432
Usage
Run Complete Pipeline
Execute the entire workflow (fetch courses, download content, generate markdown):
python main.py
Individual Components
Fetch course list only:
python course_list_info_parser.py
Download course content only:
python course_content_parser.py
Generate markdown files only:
python markdown_transcribe_hugo.py
Project Structure
songyi/
├── main.py # Main orchestration script
├── course_list_info_parser.py # Fetches course metadata
├── course_content_parser.py # Downloads course materials
├── markdown_transcribe_hugo.py # Generates Hugo markdown
├── transcribe_media.py # Audio/video transcription
├── headers.py # HTTP headers configuration
├── logging_config.py # Logging setup
├── config.ini # Configuration file (not in repo)
├── courses.db # SQLite database
├── content/ # Generated Hugo markdown files
├── course/ # Downloaded course materials
│ └── {course_id}/
│ ├── mp3/ # Audio files
│ ├── mp4/ # Video files
│ └── ...
└── json/ # API response cache
└── {course_id}.json
Workflow
- Fetch Courses: Retrieves course list from API and stores in database
- Download Content: Downloads all course materials (audio, video, images, text)
- Process Audio: Combines audio segments and transcribes them
- Generate Markdown: Creates Hugo-compatible markdown files with:
- Frontmatter (date, title)
- Text content
- Images with URLs
- Audio transcriptions
Database Schema
courses
id(INTEGER PRIMARY KEY)title(TEXT)description(TEXT)
contents
id(INTEGER PRIMARY KEY)course_id(INTEGER)content(TEXT)category(TEXT)audio_order(INTEGER)attachment_url(TEXT)mime_type(TEXT)
audio_transcriptions
id(INTEGER PRIMARY KEY AUTOINCREMENT)course_id(INTEGER)filename(TEXT)text(TEXT)UNIQUE(course_id, filename)
Features in Detail
Multi-threaded Downloads
Uses thread pools to download multiple files concurrently with configurable retry logic.
Audio Merging
Automatically detects multiple audio segments and merges them in order using FFmpeg.
Transcription Caching
Stores transcription results in the database to avoid re-processing the same audio files.
Hugo Output Format
Generates markdown files with proper Hugo frontmatter:
+++
date = '2025-10-08'
draft = false
title = 'Course Title'
+++
Course content here...
Error Handling
- Automatic retry for failed downloads (configurable)
- Skips existing files to avoid redundant downloads
- Logs all operations for debugging
- Graceful handling of missing or corrupted files
Logging
Logs are configured through logging_config.py. Check console output for progress and error messages.
Contributing
This is a personal project for archiving online course content. Feel free to fork and adapt for your own needs.
License
[Add your license here]
Notes
- Ensure you have proper authorization to download and process the course content
- The system is designed for the Bandu API structure; modifications needed for other sources
- Transcription quality depends on the FunASR/SenseVoice model configuration
- Large courses may require significant disk space and processing time