Go to file

YuanHui e823388753 fix transcribe error

(cherry picked from commit 255cc192690654535c4ebeecec1ef6500943f42e)

2025-12-23 10:33:47 +08:00

json

update file and add README

2025-12-02 00:19:00 +08:00

sample

change download folder

2025-02-16 10:23:12 +08:00

SenseVoice @ 4462e356e2

update git ignore

2025-11-21 19:59:24 +08:00

.gitignore

update file and add README

2025-12-02 00:19:00 +08:00

.gitmodules

add sensevoice submodule

2025-06-14 09:45:48 +08:00

app.log

add json files

2025-07-03 11:33:50 +08:00

audio_transcription.log

format code

2025-04-16 09:38:48 +08:00

config.ini

update git ignore

2025-11-21 19:59:24 +08:00

course_content_parser.py

update git ignore

2025-11-21 19:59:24 +08:00

course_list_info_parser.py

add hugo markdown

2025-07-03 11:32:25 +08:00

courses.db

update file and add README

2025-12-02 00:19:00 +08:00

fileconvert.py

format code

2025-04-16 09:38:48 +08:00

headers.py

first commit

2024-11-22 20:33:57 +08:00

logging_config.py

update token

2025-04-22 18:40:10 +08:00

main.py

update gitignore

2025-07-11 15:50:10 +08:00

markdown_transcribe_hugo.py

update content

2025-07-03 23:05:06 +08:00

markdown_transcribe_rag.py

update gitignore

2025-07-11 15:50:10 +08:00

markdown_transcribe.py

fix log issue

2025-04-20 08:44:06 +08:00

mongo_manager.py

format code

2025-04-16 09:38:48 +08:00

pyproject.toml

fix log issue

2025-04-20 08:44:06 +08:00

README.md

update file and add README

2025-12-02 00:19:00 +08:00

requirements.txt

fix log issue

2025-04-20 08:44:06 +08:00

silent_remover.py

format code

2025-04-16 09:38:48 +08:00

transcribe_media.py

fix transcribe error

2025-12-23 10:33:47 +08:00

uv.lock

fix log issue

2025-04-20 08:44:06 +08:00

README.md

Songyi Course Content Scraper & Transcriber

A Python-based automated system for scraping, downloading, and transcribing online course content from the Bandu API. The system converts course materials into Hugo-compatible markdown files with audio/video transcriptions.

Features

Course Data Management: Fetches and stores course metadata in SQLite/PostgreSQL databases
Multi-threaded Downloads: Efficiently downloads course materials (audio, video, images, text) using aria2c
Audio Processing: Automatically combines multiple audio segments into single MP3 files using FFmpeg
Speech-to-Text: Transcribes audio/video content using FunASR/SenseVoice models
Hugo Integration: Generates markdown files with proper frontmatter for Hugo static sites
Smart Caching: Stores transcriptions in database to avoid redundant processing

Prerequisites

System Dependencies

Python 3.12+
FFmpeg
aria2c

Python Dependencies

See requirements.txt for the full list. Key packages include:

requests
gradio_client
funasr
librosa
moviepy
pymongo
psycopg2-binary

Installation

Clone the repository:

git clone <repository-url>
cd songyi

Create and activate virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install Python dependencies:

pip install -r requirements.txt

Install system dependencies:

# Ubuntu/Debian
sudo apt-get install ffmpeg aria2

# macOS
brew install ffmpeg aria2

Create configuration file:

cp config.ini.example config.ini
# Edit config.ini with your settings

Configuration

Create a config.ini file with the following structure:

[DEFAULT]
authorization_token = your_bearer_token_here
limit = 100
offset = 0
sort = newest-first
max_download_threads = 5
max_retry_attempts = 3
download_id = 1

[POSTGRES]
dbname = your_db_name
user = your_username
password = your_password
host = localhost
port = 5432

Usage

Run Complete Pipeline

Execute the entire workflow (fetch courses, download content, generate markdown):

python main.py

Individual Components

Fetch course list only:

python course_list_info_parser.py

Download course content only:

python course_content_parser.py

Generate markdown files only:

python markdown_transcribe_hugo.py

Project Structure

songyi/
├── main.py                          # Main orchestration script
├── course_list_info_parser.py       # Fetches course metadata
├── course_content_parser.py         # Downloads course materials
├── markdown_transcribe_hugo.py      # Generates Hugo markdown
├── transcribe_media.py              # Audio/video transcription
├── headers.py                       # HTTP headers configuration
├── logging_config.py                # Logging setup
├── config.ini                       # Configuration file (not in repo)
├── courses.db                       # SQLite database
├── content/                         # Generated Hugo markdown files
├── course/                          # Downloaded course materials
│   └── {course_id}/
│       ├── mp3/                     # Audio files
│       ├── mp4/                     # Video files
│       └── ...
└── json/                            # API response cache
    └── {course_id}.json

Workflow

Fetch Courses: Retrieves course list from API and stores in database
Download Content: Downloads all course materials (audio, video, images, text)
Process Audio: Combines audio segments and transcribes them
Generate Markdown: Creates Hugo-compatible markdown files with:
- Frontmatter (date, title)
- Text content
- Images with URLs
- Audio transcriptions

Database Schema

courses

id (INTEGER PRIMARY KEY)
title (TEXT)
description (TEXT)

id (INTEGER PRIMARY KEY)
course_id (INTEGER)
content (TEXT)
category (TEXT)
audio_order (INTEGER)
attachment_url (TEXT)
mime_type (TEXT)

audio_transcriptions

id (INTEGER PRIMARY KEY AUTOINCREMENT)
course_id (INTEGER)
filename (TEXT)
text (TEXT)
UNIQUE(course_id, filename)

Features in Detail

Multi-threaded Downloads

Uses thread pools to download multiple files concurrently with configurable retry logic.

Audio Merging

Automatically detects multiple audio segments and merges them in order using FFmpeg.

Transcription Caching

Stores transcription results in the database to avoid re-processing the same audio files.

Hugo Output Format

Generates markdown files with proper Hugo frontmatter:

+++
date = '2025-10-08'
draft = false
title = 'Course Title'
+++

Course content here...

Error Handling

Automatic retry for failed downloads (configurable)
Skips existing files to avoid redundant downloads
Logs all operations for debugging
Graceful handling of missing or corrupted files

Logging

Logs are configured through logging_config.py. Check console output for progress and error messages.

Contributing

This is a personal project for archiving online course content. Feel free to fork and adapt for your own needs.

License

[Add your license here]

Notes

Ensure you have proper authorization to download and process the course content
The system is designed for the Bandu API structure; modifications needed for other sources
Transcription quality depends on the FunASR/SenseVoice model configuration
Large courses may require significant disk space and processing time

README.md

Songyi Course Content Scraper & Transcriber

Features

Prerequisites

System Dependencies

Python Dependencies

Installation

Configuration

Usage

Run Complete Pipeline

Individual Components

Project Structure

Workflow

Database Schema

courses

contents

audio_transcriptions

Features in Detail

Multi-threaded Downloads

Audio Merging

Transcription Caching

Hugo Output Format

Error Handling

Logging

Contributing

License

Notes