YuanHui e823388753 fix transcribe error
(cherry picked from commit 255cc192690654535c4ebeecec1ef6500943f42e)
2025-12-23 10:33:47 +08:00
2025-12-02 00:19:00 +08:00
2025-02-16 10:23:12 +08:00
2025-11-21 19:59:24 +08:00
2025-12-02 00:19:00 +08:00
2025-06-14 09:45:48 +08:00
2025-07-03 11:33:50 +08:00
2025-04-16 09:38:48 +08:00
2025-11-21 19:59:24 +08:00
2025-11-21 19:59:24 +08:00
2025-07-03 11:32:25 +08:00
2025-12-02 00:19:00 +08:00
2025-04-16 09:38:48 +08:00
2024-11-22 20:33:57 +08:00
2025-04-22 18:40:10 +08:00
2025-07-11 15:50:10 +08:00
2025-07-03 23:05:06 +08:00
2025-07-11 15:50:10 +08:00
2025-04-20 08:44:06 +08:00
2025-04-16 09:38:48 +08:00
2025-04-20 08:44:06 +08:00
2025-12-02 00:19:00 +08:00
2025-04-20 08:44:06 +08:00
2025-04-16 09:38:48 +08:00
2025-12-23 10:33:47 +08:00
2025-04-20 08:44:06 +08:00

Songyi Course Content Scraper & Transcriber

A Python-based automated system for scraping, downloading, and transcribing online course content from the Bandu API. The system converts course materials into Hugo-compatible markdown files with audio/video transcriptions.

Features

  • Course Data Management: Fetches and stores course metadata in SQLite/PostgreSQL databases
  • Multi-threaded Downloads: Efficiently downloads course materials (audio, video, images, text) using aria2c
  • Audio Processing: Automatically combines multiple audio segments into single MP3 files using FFmpeg
  • Speech-to-Text: Transcribes audio/video content using FunASR/SenseVoice models
  • Hugo Integration: Generates markdown files with proper frontmatter for Hugo static sites
  • Smart Caching: Stores transcriptions in database to avoid redundant processing

Prerequisites

System Dependencies

  • Python 3.12+
  • FFmpeg
  • aria2c

Python Dependencies

See requirements.txt for the full list. Key packages include:

  • requests
  • gradio_client
  • funasr
  • librosa
  • moviepy
  • pymongo
  • psycopg2-binary

Installation

  1. Clone the repository:
git clone <repository-url>
cd songyi
  1. Create and activate virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install Python dependencies:
pip install -r requirements.txt
  1. Install system dependencies:
# Ubuntu/Debian
sudo apt-get install ffmpeg aria2

# macOS
brew install ffmpeg aria2
  1. Create configuration file:
cp config.ini.example config.ini
# Edit config.ini with your settings

Configuration

Create a config.ini file with the following structure:

[DEFAULT]
authorization_token = your_bearer_token_here
limit = 100
offset = 0
sort = newest-first
max_download_threads = 5
max_retry_attempts = 3
download_id = 1

[POSTGRES]
dbname = your_db_name
user = your_username
password = your_password
host = localhost
port = 5432

Usage

Run Complete Pipeline

Execute the entire workflow (fetch courses, download content, generate markdown):

python main.py

Individual Components

Fetch course list only:

python course_list_info_parser.py

Download course content only:

python course_content_parser.py

Generate markdown files only:

python markdown_transcribe_hugo.py

Project Structure

songyi/
├── main.py                          # Main orchestration script
├── course_list_info_parser.py       # Fetches course metadata
├── course_content_parser.py         # Downloads course materials
├── markdown_transcribe_hugo.py      # Generates Hugo markdown
├── transcribe_media.py              # Audio/video transcription
├── headers.py                       # HTTP headers configuration
├── logging_config.py                # Logging setup
├── config.ini                       # Configuration file (not in repo)
├── courses.db                       # SQLite database
├── content/                         # Generated Hugo markdown files
├── course/                          # Downloaded course materials
│   └── {course_id}/
│       ├── mp3/                     # Audio files
│       ├── mp4/                     # Video files
│       └── ...
└── json/                            # API response cache
    └── {course_id}.json

Workflow

  1. Fetch Courses: Retrieves course list from API and stores in database
  2. Download Content: Downloads all course materials (audio, video, images, text)
  3. Process Audio: Combines audio segments and transcribes them
  4. Generate Markdown: Creates Hugo-compatible markdown files with:
    • Frontmatter (date, title)
    • Text content
    • Images with URLs
    • Audio transcriptions

Database Schema

courses

  • id (INTEGER PRIMARY KEY)
  • title (TEXT)
  • description (TEXT)

contents

  • id (INTEGER PRIMARY KEY)
  • course_id (INTEGER)
  • content (TEXT)
  • category (TEXT)
  • audio_order (INTEGER)
  • attachment_url (TEXT)
  • mime_type (TEXT)

audio_transcriptions

  • id (INTEGER PRIMARY KEY AUTOINCREMENT)
  • course_id (INTEGER)
  • filename (TEXT)
  • text (TEXT)
  • UNIQUE(course_id, filename)

Features in Detail

Multi-threaded Downloads

Uses thread pools to download multiple files concurrently with configurable retry logic.

Audio Merging

Automatically detects multiple audio segments and merges them in order using FFmpeg.

Transcription Caching

Stores transcription results in the database to avoid re-processing the same audio files.

Hugo Output Format

Generates markdown files with proper Hugo frontmatter:

+++
date = '2025-10-08'
draft = false
title = 'Course Title'
+++

Course content here...

Error Handling

  • Automatic retry for failed downloads (configurable)
  • Skips existing files to avoid redundant downloads
  • Logs all operations for debugging
  • Graceful handling of missing or corrupted files

Logging

Logs are configured through logging_config.py. Check console output for progress and error messages.

Contributing

This is a personal project for archiving online course content. Feel free to fork and adapt for your own needs.

License

[Add your license here]

Notes

  • Ensure you have proper authorization to download and process the course content
  • The system is designed for the Bandu API structure; modifications needed for other sources
  • Transcription quality depends on the FunASR/SenseVoice model configuration
  • Large courses may require significant disk space and processing time
Description
No description provided
Readme 14 MiB
Languages
Python 100%