Automating Meeting Minutes with Whisper and Python
A guide to setting up a local AI transcription pipeline and converting raw audio into structured Minutes of Meeting (MoM) documents.
Meetings are essential, but writing Minutes of Meeting (MoM) is often a tedious manual task. When handling large-scale, confidential sharing sessions, sending audio recordings to third-party APIs can be costly and pose privacy risks.
In this lab, we will build a private, local, end-to-end automation pipeline. We will write scripts to segment large audio files, transcribe them using a local instance of OpenAI's Whisper model with GPU acceleration, and generate formatted Word documents (.docx) using Python.
Technical Architecture
Here is the workflow of our automated pipeline:
[Raw Audio File]
│
▼
[Segmenter Script] ──(Splits large files via FFmpeg)──> [Audio Chunks]
│
▼
[Raw Text Transcript] <──(GPU-Accelerated Speech-to-Text)── [Local Whisper]
│
▼
[LLM Summarization] ──(Formats with Custom Prompts)──> [Structured MoM]
│
▼
[Final MoM.docx] <──(python-docx Compiler)── [docx Compiler Engine]
Pipeline Processes
Segmenter Script: Uses FFmpeg to split large files into 5-minute audio chunks to prevent memory overload.
Local Whisper Engine: Runs a local speech-to-text model on the chunks, leveraging GPU acceleration.
LLM Summarization: Uses an LLM to clean, organize, and structure the raw transcript into standard markdown.
docx Compiler Engine: Compiles the final markdown into a formatted Microsoft Word document (.docx).
Phase 1: Environment Setup
Before starting, ensure your system has a Python environment configured with GPU support (if available) and the necessary command-line utilities.
1. Install FFmpeg
Whisper relies on FFmpeg for audio processing.
Windows: Download build from Gyan.dev and add the
binfolder to your System PATH.macOS: Run
brew install ffmpegLinux: Run
sudo apt install ffmpeg
2. Verify GPU Acceleration
To make local transcription fast, we need PyTorch to detect your GPU (CUDA). Open your terminal and run:
python -c "import torch; print('CUDA Available:', torch.cuda.is_available())"
If it returns False, you can still run transcription on your CPU, but it will take significantly longer.
3. Install Dependencies
Install PyTorch (CUDA-enabled if your hardware supports it), Whisper, and Python document utilities:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install openai-whisper python-docx ffmpeg-python
Phase 2: Splitting Large Audio Files
Whisper runs into API limits (e.g. 25MB) if you use cloud services, and locally, transcribing massive continuous files (60MB+ / 1 hour+) can result in memory issues. Chunking the audio makes the process modular and robust.
Save this script as segment_audio.py to chunk your audio files:
import subprocess
import os
def split_audio(input_file, segment_length_seconds=300, output_dir="chunks"):
"""Splits an audio file into smaller segments using FFmpeg."""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Get total duration
cmd_duration = [
"ffprobe", "-v", "error", "-show_entries",
"format=duration", "-of", "default=noprint_wrappers=1:nokey=1",
input_file
]
duration = float(subprocess.check_output(cmd_duration).strip())
print(f"Total duration: {duration} seconds ({duration / 60:.2f} minutes)")
chunk_index = 0
start_time = 0
while start_time < duration:
output_file = os.path.join(output_dir, f"chunk_{chunk_index:03d}.mp3")
cmd_split = [
"ffmpeg", "-y", "-ss", str(start_time), "-t", str(segment_length_seconds),
"-i", input_file, "-acodec", "libmp3lame", "-b:a", "64k", output_file
]
subprocess.run(cmd_split, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
print(f"Generated {output_file} starting at {start_time}s")
start_time += segment_length_seconds
chunk_index += 1
if __name__ == "__main__":
split_audio("meeting_recording.aac", segment_length_seconds=300)
Phase 3: Transcribing Locally with Whisper
Once chunked, we can run Whisper over each segment. We'll use the tiny or base model for speed, but you can upgrade to small, medium, or large-v3 for higher accuracy.
Save this as transcribe_chunks.py:
import os
import whisper
def transcribe_directory(chunk_dir="chunks", output_file="raw_transcript.txt"):
# Load Whisper model (tiny, base, small, medium, or large)
print("Loading Whisper model...")
model = whisper.load_model("base")
chunks = sorted([f for f in os.listdir(chunk_dir) if f.endswith(".mp3")])
full_transcript = []
for chunk in chunks:
chunk_path = os.path.join(chunk_dir, chunk)
print(f"Transcribing {chunk_path}...")
result = model.transcribe(chunk_path, language="id") # Set your language here (e.g. "id" or "en")
full_transcript.append(result["text"])
# Combine and save
combined_text = "\n".join(full_transcript)
with open(output_file, "w", encoding="utf-8") as f:
f.write(combined_text)
print(f"Transcription complete! Saved to {output_file}")
if __name__ == "__main__":
transcribe_directory()
Phase 4: Summarizing and Structuring
Now we have a raw, unstructured block of text. To convert this text into an official Minutes of Meeting, we utilize an LLM (such as Gemini or Claude). Use the following prompt format to instruct your AI model:
You are an executive assistant. Take the following raw meeting transcript and structure it into a formal Minutes of Meeting (MoM) in Indonesian/English.
Your output MUST follow this template:
# Minutes of Meeting (MoM)
## [Meeting Topic Name]
- **Date**: [Date]
- **Attendees**: [List of attendees]
### Executive Summary
[A concise 3-4 sentence paragraph summarizing the key purpose, outcomes, and major decisions of the meeting]
### Discussion & Agenda Items
1. **[Topic 1]**
- Point 1.
- Point 2.
2. **[Topic 2]**
- Point 1.
### Key Decisions Table
| No | Decision | Description / Notes |
|---|---|---|
| 1 | [Decision Name] | [Details] |
### Action Items Table
| No | Action Item | Assignee | Deadline |
|---|---|---|---|
| 1 | [Task Description] | [Person Name] | [Date] |
Phase 5: Generating Word Documents (.docx) Programmatically
Once the LLM outputs the structured Markdown, we can compile it into a styled Word Document (.docx) using Python's python-docx library. This ensures layout consistency without opening Word manually.
Save this script as generate_word.py:
import docx
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement, parse_xml
from docx.oxml.ns import qn, nsdecls
def create_element(name):
return OxmlElement(name)
def set_cell_background(cell, fill_hex):
"""Sets background color of a table cell."""
tcPr = cell._tc.get_or_add_tcPr()
shd = parse_xml(f'<w:shd {nsdecls("w")} w:fill="{fill_hex}"/>')
tcPr.append(shd)
def generate_mom_document(output_path="MoM_Formatted.docx"):
doc = docx.Document()
# 1. Page Margin Settings
sections = doc.sections
for section in sections:
section.top_margin = Inches(1.0)
section.bottom_margin = Inches(1.0)
section.left_margin = Inches(1.0)
section.right_margin = Inches(1.0)
# 2. Color Palette Setup
COLOR_PRIMARY = RGBColor(31, 78, 121) # Deep Blue
COLOR_SECONDARY = RGBColor(127, 127, 127) # Cool Gray
# 3. Document Title
title = doc.add_paragraph()
title.alignment = WD_ALIGN_PARAGRAPH.CENTER
run = title.add_run("MINUTES OF MEETING")
run.font.name = "Arial"
run.font.size = Pt(20)
run.font.bold = True
run.font.color.rgb = COLOR_PRIMARY
# Add metadata block
p_meta = doc.add_paragraph()
p_meta.add_run("Date: ").bold = True
p_meta.add_run("June 26, 2026\n")
p_meta.add_run("Attendees: ").bold = True
p_meta.add_run("Example Team A, Example Team B\n")
# 4. Heading Style Helper
def add_custom_heading(text, level):
h = doc.add_heading(level=level)
h.paragraph_format.space_before = Pt(12)
h.paragraph_format.space_after = Pt(6)
r = h.add_run(text)
r.font.name = "Arial"
r.font.size = Pt(14) if level == 1 else Pt(12)
r.font.bold = True
r.font.color.rgb = COLOR_PRIMARY
return h
# Section 1: Executive Summary
add_custom_heading("1. Executive Summary", level=1)
p_sum = doc.add_paragraph("This session outlined the operational integration of the static QRIS gateway with the client platform, highlighting the manual validation process and mitigation of payment spoofing.")
# Section 2: Key Decisions (Table Layout)
add_custom_heading("2. Key Decisions", level=1)
table = doc.add_table(rows=1, cols=3)
table.style = 'Table Grid'
# Set headers
headers = ["No", "Key Decision", "Description"]
hdr_cells = table.rows[0].cells
for i, title_text in enumerate(headers):
hdr_cells[i].text = title_text
set_cell_background(hdr_cells[i], "1F4E79")
for p in hdr_cells[i].paragraphs:
for r in p.runs:
r.font.color.rgb = RGBColor(255, 255, 255)
r.font.bold = True
# Add decision row
row_cells = table.add_row().cells
row_cells[0].text = "1"
row_cells[1].text = "Static-to-Dynamic Conversion"
row_cells[2].text = "Inject amount on client-side and recalculate CRC16 to avoid manual typing."
# Save the file
doc.save(output_path)
print(f"Word Document compiled successfully at {output_path}!")
if __name__ == "__main__":
generate_mom_document()
Conclusion
By running a local Whisper stack coupled with programmatic template engines, operations teams can quickly digest hours of meeting audio into formatted reports. This keeps proprietary or sensitive financial data entirely within local bounds while optimizing operational throughput.
Thanks for reading. See you in the next lab.


