Automating Meeting Minutes with Whisper and Python

A guide to setting up a local AI transcription pipeline and converting raw audio into structured Minutes of Meeting (MoM) documents.

Jun 26, 2026

Automating Meeting Minutes with Whisper and Python

Meetings are essential, but writing Minutes of Meeting (MoM) is often a tedious manual task. When handling large-scale, confidential sharing sessions, sending audio recordings to third-party APIs can be costly and pose privacy risks.

In this lab, we will build a private, local, end-to-end automation pipeline. We will write scripts to segment large audio files, transcribe them using a local instance of OpenAI's Whisper model with GPU acceleration, and generate formatted Word documents (.docx) using Python.

Technical Architecture

Here is the workflow of our automated pipeline:

  [Raw Audio File]
         │
         ▼
  [Segmenter Script] ──(Splits large files via FFmpeg)──> [Audio Chunks]
                                                                 │
                                                                 ▼
  [Raw Text Transcript] <──(GPU-Accelerated Speech-to-Text)── [Local Whisper]
         │
         ▼
  [LLM Summarization] ──(Formats with Custom Prompts)──> [Structured MoM]
                                                                 │
                                                                 ▼
  [Final MoM.docx] <──(python-docx Compiler)── [docx Compiler Engine]

Pipeline Processes

Segmenter Script: Uses FFmpeg to split large files into 5-minute audio chunks to prevent memory overload.
Local Whisper Engine: Runs a local speech-to-text model on the chunks, leveraging GPU acceleration.
LLM Summarization: Uses an LLM to clean, organize, and structure the raw transcript into standard markdown.
docx Compiler Engine: Compiles the final markdown into a formatted Microsoft Word document (.docx).

Phase 1: Environment Setup

Before starting, ensure your system has a Python environment configured with GPU support (if available) and the necessary command-line utilities.

1. Install FFmpeg

Whisper relies on FFmpeg for audio processing.

Windows: Download build from Gyan.dev and add the bin folder to your System PATH.
macOS: Run brew install ffmpeg
Linux: Run sudo apt install ffmpeg

2. Verify GPU Acceleration

To make local transcription fast, we need PyTorch to detect your GPU (CUDA). Open your terminal and run:

python -c "import torch; print('CUDA Available:', torch.cuda.is_available())"

If it returns False, you can still run transcription on your CPU, but it will take significantly longer.

3. Install Dependencies

Install PyTorch (CUDA-enabled if your hardware supports it), Whisper, and Python document utilities:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install openai-whisper python-docx ffmpeg-python

Phase 2: Splitting Large Audio Files

Whisper runs into API limits (e.g. 25MB) if you use cloud services, and locally, transcribing massive continuous files (60MB+ / 1 hour+) can result in memory issues. Chunking the audio makes the process modular and robust.

Save this script as segment_audio.py to chunk your audio files:

import subprocess
import os

def split_audio(input_file, segment_length_seconds=300, output_dir="chunks"):
    """Splits an audio file into smaller segments using FFmpeg."""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        
    # Get total duration
    cmd_duration = [
        "ffprobe", "-v", "error", "-show_entries", 
        "format=duration", "-of", "default=noprint_wrappers=1:nokey=1", 
        input_file
    ]
    duration = float(subprocess.check_output(cmd_duration).strip())
    
    print(f"Total duration: {duration} seconds ({duration / 60:.2f} minutes)")
    
    chunk_index = 0
    start_time = 0
    
    while start_time < duration:
        output_file = os.path.join(output_dir, f"chunk_{chunk_index:03d}.mp3")
        cmd_split = [
            "ffmpeg", "-y", "-ss", str(start_time), "-t", str(segment_length_seconds),
            "-i", input_file, "-acodec", "libmp3lame", "-b:a", "64k", output_file
        ]
        subprocess.run(cmd_split, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        print(f"Generated {output_file} starting at {start_time}s")
        
        start_time += segment_length_seconds
        chunk_index += 1

if __name__ == "__main__":
    split_audio("meeting_recording.aac", segment_length_seconds=300)

Phase 3: Transcribing Locally with Whisper

Once chunked, we can run Whisper over each segment. We'll use the tiny or base model for speed, but you can upgrade to small, medium, or large-v3 for higher accuracy.

Save this as transcribe_chunks.py:

import os
import whisper

def transcribe_directory(chunk_dir="chunks", output_file="raw_transcript.txt"):
    # Load Whisper model (tiny, base, small, medium, or large)
    print("Loading Whisper model...")
    model = whisper.load_model("base") 
    
    chunks = sorted([f for f in os.listdir(chunk_dir) if f.endswith(".mp3")])
    full_transcript = []
    
    for chunk in chunks:
        chunk_path = os.path.join(chunk_dir, chunk)
        print(f"Transcribing {chunk_path}...")
        
        result = model.transcribe(chunk_path, language="id") # Set your language here (e.g. "id" or "en")
        full_transcript.append(result["text"])
        
    # Combine and save
    combined_text = "\n".join(full_transcript)
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(combined_text)
        
    print(f"Transcription complete! Saved to {output_file}")

if __name__ == "__main__":
    transcribe_directory()

Phase 4: Summarizing and Structuring

Now we have a raw, unstructured block of text. To convert this text into an official Minutes of Meeting, we utilize an LLM (such as Gemini or Claude). Use the following prompt format to instruct your AI model:

You are an executive assistant. Take the following raw meeting transcript and structure it into a formal Minutes of Meeting (MoM) in Indonesian/English. 

Your output MUST follow this template:
# Minutes of Meeting (MoM)
## [Meeting Topic Name]
- **Date**: [Date]
- **Attendees**: [List of attendees]

### Executive Summary
[A concise 3-4 sentence paragraph summarizing the key purpose, outcomes, and major decisions of the meeting]

### Discussion & Agenda Items
1. **[Topic 1]**
   - Point 1.
   - Point 2.
2. **[Topic 2]**
   - Point 1.

### Key Decisions Table
| No | Decision | Description / Notes |
|---|---|---|
| 1 | [Decision Name] | [Details] |

### Action Items Table
| No | Action Item | Assignee | Deadline |
|---|---|---|---|
| 1 | [Task Description] | [Person Name] | [Date] |

Phase 5: Generating Word Documents (.docx) Programmatically

Once the LLM outputs the structured Markdown, we can compile it into a styled Word Document (.docx) using Python's python-docx library. This ensures layout consistency without opening Word manually.

Save this script as generate_word.py:

import docx
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement, parse_xml
from docx.oxml.ns import qn, nsdecls

def create_element(name):
    return OxmlElement(name)

def set_cell_background(cell, fill_hex):
    """Sets background color of a table cell."""
    tcPr = cell._tc.get_or_add_tcPr()
    shd = parse_xml(f'<w:shd {nsdecls("w")} w:fill="{fill_hex}"/>')
    tcPr.append(shd)

def generate_mom_document(output_path="MoM_Formatted.docx"):
    doc = docx.Document()
    
    # 1. Page Margin Settings
    sections = doc.sections
    for section in sections:
        section.top_margin = Inches(1.0)
        section.bottom_margin = Inches(1.0)
        section.left_margin = Inches(1.0)
        section.right_margin = Inches(1.0)
        
    # 2. Color Palette Setup
    COLOR_PRIMARY = RGBColor(31, 78, 121)     # Deep Blue
    COLOR_SECONDARY = RGBColor(127, 127, 127) # Cool Gray
    
    # 3. Document Title
    title = doc.add_paragraph()
    title.alignment = WD_ALIGN_PARAGRAPH.CENTER
    run = title.add_run("MINUTES OF MEETING")
    run.font.name = "Arial"
    run.font.size = Pt(20)
    run.font.bold = True
    run.font.color.rgb = COLOR_PRIMARY
    
    # Add metadata block
    p_meta = doc.add_paragraph()
    p_meta.add_run("Date: ").bold = True
    p_meta.add_run("June 26, 2026\n")
    p_meta.add_run("Attendees: ").bold = True
    p_meta.add_run("Example Team A, Example Team B\n")
    
    # 4. Heading Style Helper
    def add_custom_heading(text, level):
        h = doc.add_heading(level=level)
        h.paragraph_format.space_before = Pt(12)
        h.paragraph_format.space_after = Pt(6)
        r = h.add_run(text)
        r.font.name = "Arial"
        r.font.size = Pt(14) if level == 1 else Pt(12)
        r.font.bold = True
        r.font.color.rgb = COLOR_PRIMARY
        return h

    # Section 1: Executive Summary
    add_custom_heading("1. Executive Summary", level=1)
    p_sum = doc.add_paragraph("This session outlined the operational integration of the static QRIS gateway with the client platform, highlighting the manual validation process and mitigation of payment spoofing.")
    
    # Section 2: Key Decisions (Table Layout)
    add_custom_heading("2. Key Decisions", level=1)
    table = doc.add_table(rows=1, cols=3)
    table.style = 'Table Grid'
    
    # Set headers
    headers = ["No", "Key Decision", "Description"]
    hdr_cells = table.rows[0].cells
    for i, title_text in enumerate(headers):
        hdr_cells[i].text = title_text
        set_cell_background(hdr_cells[i], "1F4E79")
        for p in hdr_cells[i].paragraphs:
            for r in p.runs:
                r.font.color.rgb = RGBColor(255, 255, 255)
                r.font.bold = True
                
    # Add decision row
    row_cells = table.add_row().cells
    row_cells[0].text = "1"
    row_cells[1].text = "Static-to-Dynamic Conversion"
    row_cells[2].text = "Inject amount on client-side and recalculate CRC16 to avoid manual typing."
    
    # Save the file
    doc.save(output_path)
    print(f"Word Document compiled successfully at {output_path}!")

if __name__ == "__main__":
    generate_mom_document()

Conclusion

By running a local Whisper stack coupled with programmatic template engines, operations teams can quickly digest hours of meeting audio into formatted reports. This keeps proprietary or sensitive financial data entirely within local bounds while optimizing operational throughput.

Thanks for reading. See you in the next lab.

Farros FR

Discussion about this post

Ready for more?