Compare commits

...

2 Commits

Author SHA1 Message Date
YuanHui
e823388753 fix transcribe error
(cherry picked from commit 255cc192690654535c4ebeecec1ef6500943f42e)
2025-12-23 10:33:47 +08:00
0ce1ec18c3 update file and add README 2025-12-02 00:19:00 +08:00
5 changed files with 262 additions and 6 deletions

3
.gitignore vendored
View File

@@ -30,7 +30,8 @@
*.tar.bz2
*.tgz
*.md
/markdown/*.md
/content/*.md
# 其他格式的媒体文件
/.venv/

218
README.md Normal file
View File

@@ -0,0 +1,218 @@
# Songyi Course Content Scraper & Transcriber
A Python-based automated system for scraping, downloading, and transcribing online course content from the Bandu API. The system converts course materials into Hugo-compatible markdown files with audio/video transcriptions.
## Features
- **Course Data Management**: Fetches and stores course metadata in SQLite/PostgreSQL databases
- **Multi-threaded Downloads**: Efficiently downloads course materials (audio, video, images, text) using aria2c
- **Audio Processing**: Automatically combines multiple audio segments into single MP3 files using FFmpeg
- **Speech-to-Text**: Transcribes audio/video content using FunASR/SenseVoice models
- **Hugo Integration**: Generates markdown files with proper frontmatter for Hugo static sites
- **Smart Caching**: Stores transcriptions in database to avoid redundant processing
## Prerequisites
### System Dependencies
- Python 3.12+
- FFmpeg
- aria2c
### Python Dependencies
See [requirements.txt](requirements.txt) for the full list. Key packages include:
- requests
- gradio_client
- funasr
- librosa
- moviepy
- pymongo
- psycopg2-binary
## Installation
1. Clone the repository:
```bash
git clone <repository-url>
cd songyi
```
2. Create and activate virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```
3. Install Python dependencies:
```bash
pip install -r requirements.txt
```
4. Install system dependencies:
```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg aria2
# macOS
brew install ffmpeg aria2
```
5. Create configuration file:
```bash
cp config.ini.example config.ini
# Edit config.ini with your settings
```
## Configuration
Create a `config.ini` file with the following structure:
```ini
[DEFAULT]
authorization_token = your_bearer_token_here
limit = 100
offset = 0
sort = newest-first
max_download_threads = 5
max_retry_attempts = 3
download_id = 1
[POSTGRES]
dbname = your_db_name
user = your_username
password = your_password
host = localhost
port = 5432
```
## Usage
### Run Complete Pipeline
Execute the entire workflow (fetch courses, download content, generate markdown):
```bash
python main.py
```
### Individual Components
**Fetch course list only:**
```bash
python course_list_info_parser.py
```
**Download course content only:**
```bash
python course_content_parser.py
```
**Generate markdown files only:**
```bash
python markdown_transcribe_hugo.py
```
## Project Structure
```
songyi/
├── main.py # Main orchestration script
├── course_list_info_parser.py # Fetches course metadata
├── course_content_parser.py # Downloads course materials
├── markdown_transcribe_hugo.py # Generates Hugo markdown
├── transcribe_media.py # Audio/video transcription
├── headers.py # HTTP headers configuration
├── logging_config.py # Logging setup
├── config.ini # Configuration file (not in repo)
├── courses.db # SQLite database
├── content/ # Generated Hugo markdown files
├── course/ # Downloaded course materials
│ └── {course_id}/
│ ├── mp3/ # Audio files
│ ├── mp4/ # Video files
│ └── ...
└── json/ # API response cache
└── {course_id}.json
```
## Workflow
1. **Fetch Courses**: Retrieves course list from API and stores in database
2. **Download Content**: Downloads all course materials (audio, video, images, text)
3. **Process Audio**: Combines audio segments and transcribes them
4. **Generate Markdown**: Creates Hugo-compatible markdown files with:
- Frontmatter (date, title)
- Text content
- Images with URLs
- Audio transcriptions
## Database Schema
### courses
- `id` (INTEGER PRIMARY KEY)
- `title` (TEXT)
- `description` (TEXT)
### contents
- `id` (INTEGER PRIMARY KEY)
- `course_id` (INTEGER)
- `content` (TEXT)
- `category` (TEXT)
- `audio_order` (INTEGER)
- `attachment_url` (TEXT)
- `mime_type` (TEXT)
### audio_transcriptions
- `id` (INTEGER PRIMARY KEY AUTOINCREMENT)
- `course_id` (INTEGER)
- `filename` (TEXT)
- `text` (TEXT)
- `UNIQUE(course_id, filename)`
## Features in Detail
### Multi-threaded Downloads
Uses thread pools to download multiple files concurrently with configurable retry logic.
### Audio Merging
Automatically detects multiple audio segments and merges them in order using FFmpeg.
### Transcription Caching
Stores transcription results in the database to avoid re-processing the same audio files.
### Hugo Output Format
Generates markdown files with proper Hugo frontmatter:
```markdown
+++
date = '2025-10-08'
draft = false
title = 'Course Title'
+++
Course content here...
```
## Error Handling
- Automatic retry for failed downloads (configurable)
- Skips existing files to avoid redundant downloads
- Logs all operations for debugging
- Graceful handling of missing or corrupted files
## Logging
Logs are configured through [logging_config.py](logging_config.py). Check console output for progress and error messages.
## Contributing
This is a personal project for archiving online course content. Feel free to fork and adapt for your own needs.
## License
[Add your license here]
## Notes
- Ensure you have proper authorization to download and process the course content
- The system is designed for the Bandu API structure; modifications needed for other sources
- Transcription quality depends on the FunASR/SenseVoice model configuration
- Large courses may require significant disk space and processing time

Binary file not shown.

32
json/745.json Normal file
View File

@@ -0,0 +1,32 @@
{
"ts": 1764605503269,
"data": [
{
"id": 14900,
"course_id": 745,
"content": "7d05ce95-2467-4744-b317-8eac65568b93.m3u8",
"category": "video",
"attachment_id": "7d05ce95-2467-4744-b317-8eac65568b93",
"order": 0,
"duration": 4308320,
"created_at": "2025-11-30T13:01:32.073Z",
"updated_at": "2025-11-30T13:03:19.25Z",
"attachment": {
"id": 102535,
"attachment_id": "7d05ce95-2467-4744-b317-8eac65568b93",
"name": "7d05ce95-2467-4744-b317-8eac65568b93.m3u8",
"thumb": "",
"raw": "https://pili-vod.songy.info/7d05ce95-2467-4744-b317-8eac65568b93.m3u8",
"size": 0,
"duration": 4308320,
"mime_type": "application/x-mpegurl",
"location": "qiniu",
"created_at": "2025-11-30T13:01:32.07Z",
"updated_at": "2025-11-30T13:03:19.246Z",
"url": "https://pili-vod.songy.info/7d05ce95-2467-4744-b317-8eac65568b93.m3u8",
"raw_url": "https://pili-vod.songy.info/7d05ce95-2467-4744-b317-8eac65568b93.m3u8",
"thumb_url": "https://pili-vod.songy.info/7d05ce95-2467-4744-b317-8eac65568b93.m3u8"
}
}
]
}

View File

@@ -1,5 +1,6 @@
import os
import argparse
import uuid
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
@@ -11,8 +12,11 @@ from logging_config import setup_logging
logger = setup_logging()
def extract_or_convert_audio(file_path, output_audio_path="processed_audio.wav"):
def extract_or_convert_audio(file_path, output_audio_path="processed_audio"):
ext = os.path.splitext(file_path)[1].lower()
filename = os.path.basename(file_path)
random_uuid = str(uuid.uuid4())
output_audio_path = output_audio_path + "_" + random_uuid + ".wav"
if ext in [".mp4", ".mov", ".avi", ".mkv"]:
logger.info("🎬 Extracting audio from video...")
@@ -27,11 +31,12 @@ def extract_or_convert_audio(file_path, output_audio_path="processed_audio.wav")
sound.export(output_audio_path, format="wav")
else:
raise ValueError(f"Unsupported file type: {ext}")
logger.info(f"Converted Audio saved to: {output_audio_path}")
return output_audio_path
def transcribe_audio_funasr(audio_path, device="cuda:0"):
def transcribe_audio_funasr(audio_path, device="cpu"):
logger.info("🧠 Loading FunASR model...")
model = AutoModel(
model="iic/SenseVoiceSmall",
@@ -59,7 +64,7 @@ def transcribe_audio_funasr(audio_path, device="cuda:0"):
# 加载模型并作为全局变量
default_model = AutoModel(model="iic/SenseVoiceSmall", trust_remote_code=True, device="cuda:0", disable_update=True)
default_model = AutoModel(model="iic/SenseVoiceSmall", trust_remote_code=True, device="cpu", disable_update=True)
def transcribe_audio_funasr_batch(audio_path):
res = default_model.generate(
@@ -152,8 +157,8 @@ def convert_media(file_path, is_batch=False, save_to_disk=True):
logger.info(f"✅ Transcript saved to: {output_path}")
return transcript
finally:
if os.path.exists("processed_audio.wav"):
os.remove("processed_audio.wav")
if os.path.exists(audio_file):
os.remove(audio_file)
def process_input(path, recursive=False):