songyi/README.md

# Songyi Course Content Scraper & Transcriber

A Python-based automated system for scraping, downloading, and transcribing online course content from the Bandu API. The system converts course materials into Hugo-compatible markdown files with audio/video transcriptions.

## Features

- **Course Data Management**: Fetches and stores course metadata in SQLite/PostgreSQL databases
- **Multi-threaded Downloads**: Efficiently downloads course materials (audio, video, images, text) using aria2c
- **Audio Processing**: Automatically combines multiple audio segments into single MP3 files using FFmpeg
- **Speech-to-Text**: Transcribes audio/video content using FunASR/SenseVoice models
- **Hugo Integration**: Generates markdown files with proper frontmatter for Hugo static sites
- **Smart Caching**: Stores transcriptions in database to avoid redundant processing

## Prerequisites

### System Dependencies
- Python 3.12+
- FFmpeg
- aria2c

### Python Dependencies
See [requirements.txt](requirements.txt) for the full list. Key packages include:
- requests
- gradio_client
- funasr
- librosa
- moviepy
- pymongo
- psycopg2-binary

## Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd songyi
```

2. Create and activate virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
```

3. Install Python dependencies:
```bash
pip install -r requirements.txt
```

4. Install system dependencies:
```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg aria2

# macOS
brew install ffmpeg aria2
```

5. Create configuration file:
```bash
cp config.ini.example config.ini
# Edit config.ini with your settings
```

## Configuration

Create a `config.ini` file with the following structure:

```ini
[DEFAULT]
authorization_token = your_bearer_token_here
limit = 100
offset = 0
sort = newest-first
max_download_threads = 5
max_retry_attempts = 3
download_id = 1

[POSTGRES]
dbname = your_db_name
user = your_username
password = your_password
host = localhost
port = 5432
```

## Usage

### Run Complete Pipeline

Execute the entire workflow (fetch courses, download content, generate markdown):

```bash
python main.py
```

### Individual Components

**Fetch course list only:**
```bash
python course_list_info_parser.py
```

**Download course content only:**
```bash
python course_content_parser.py
```

**Generate markdown files only:**
```bash
python markdown_transcribe_hugo.py
```

## Project Structure

```
songyi/
├── main.py                          # Main orchestration script
├── course_list_info_parser.py       # Fetches course metadata
├── course_content_parser.py         # Downloads course materials
├── markdown_transcribe_hugo.py      # Generates Hugo markdown
├── transcribe_media.py              # Audio/video transcription
├── headers.py                       # HTTP headers configuration
├── logging_config.py                # Logging setup
├── config.ini                       # Configuration file (not in repo)
├── courses.db                       # SQLite database
├── content/                         # Generated Hugo markdown files
├── course/                          # Downloaded course materials
│   └── {course_id}/
│       ├── mp3/                     # Audio files
│       ├── mp4/                     # Video files
│       └── ...
└── json/                            # API response cache
    └── {course_id}.json
```

## Workflow

1. **Fetch Courses**: Retrieves course list from API and stores in database
2. **Download Content**: Downloads all course materials (audio, video, images, text)
3. **Process Audio**: Combines audio segments and transcribes them
4. **Generate Markdown**: Creates Hugo-compatible markdown files with:
   - Frontmatter (date, title)
   - Text content
   - Images with URLs
   - Audio transcriptions

## Database Schema

### courses
- `id` (INTEGER PRIMARY KEY)
- `title` (TEXT)
- `description` (TEXT)

### contents
- `id` (INTEGER PRIMARY KEY)
- `course_id` (INTEGER)
- `content` (TEXT)
- `category` (TEXT)
- `audio_order` (INTEGER)
- `attachment_url` (TEXT)
- `mime_type` (TEXT)

### audio_transcriptions
- `id` (INTEGER PRIMARY KEY AUTOINCREMENT)
- `course_id` (INTEGER)
- `filename` (TEXT)
- `text` (TEXT)
- `UNIQUE(course_id, filename)`

## Features in Detail

### Multi-threaded Downloads
Uses thread pools to download multiple files concurrently with configurable retry logic.

### Audio Merging
Automatically detects multiple audio segments and merges them in order using FFmpeg.

### Transcription Caching
Stores transcription results in the database to avoid re-processing the same audio files.

### Hugo Output Format
Generates markdown files with proper Hugo frontmatter:
```markdown
+++
date = '2025-10-08'
draft = false
title = 'Course Title'
+++

Course content here...
```

## Error Handling

- Automatic retry for failed downloads (configurable)
- Skips existing files to avoid redundant downloads
- Logs all operations for debugging
- Graceful handling of missing or corrupted files

## Logging

Logs are configured through [logging_config.py](logging_config.py). Check console output for progress and error messages.

## Contributing

This is a personal project for archiving online course content. Feel free to fork and adapt for your own needs.

## License

[Add your license here]

## Notes

- Ensure you have proper authorization to download and process the course content
- The system is designed for the Bandu API structure; modifications needed for other sources
- Transcription quality depends on the FunASR/SenseVoice model configuration
- Large courses may require significant disk space and processing time