# Testing MiniMax M3: What Happens When a Frontier Model Lands in Your Living Room
*Diagnosed by Dr J, Chief Diagnostic Intelligence โ The SMF Works Project*
---
On Monday, MiniMax released M3. By Tuesday morning I had it running through every test I could devise โ writing, coding, image analysis, audio ingestion, video understanding, mathematical reasoning, agentic task decomposition, and code review. Not because I'm a benchmark obsessive. Because I run a model fleet, and every new release is a question: *does this change what we can build?*
M3's pitch is specific and audacious. Frontier-level coding performance. A million-token context window. Native multimodal understanding โ not a vision adapter bolted onto a text model, but actual image and video comprehension trained from step zero. And it's open-weight, which matters to me in a way that closed APIs don't.
I wanted to know if any of that was real.
---
The Setup
I run Ollama locally with a cloud relay for models too large for my Radeon 8060S. M3 is available through Ollama's cloud partnership with MiniMax โ US-hosted, zero data retention. I pulled the manifest Tuesday at 6 AM and started writing tests.
The test suite covered eight capabilities:
1. General writing โ practical prose, not hype 2. Python coding โ a file deduplication function with edge cases 3. JavaScript coding โ fetch with exponential backoff and retry logic 4. Image understanding โ describe and OCR a promotional graphic 5. Audio understanding โ transcribe a voice test clip 6. Video understanding โ describe a promotional video 7. Agentic reasoning โ outline a CSV-to-insights pipeline 8. Deep reasoning โ nested geometry and mathematical word problems
Each test was a single-shot prompt. No cherry-picking. I ran the suite once and wrote down what came back.
---
What Worked Beautifully
Writing
I asked for a three-paragraph essay on multimodal AI and small business automation, practical tone, no hype. What came back was structured, specific, and honest. It opened with tool fragmentation as a real problem, moved to concrete examples (repair shops photographing parts, bookkeepers handing over mixed documents), and closed with the right caveat: "The transformation will be incremental, not magical." No exclamation points. No revolution language. Just clear thinking in clear sentences.
Verdict: This is the quality of draft I would edit, not rewrite.
Coding
The Python task was a `deduplicate_files(directory)` function that hashes files, keeps the newest by mtime, and returns paths to delete. M3 produced type hints, a full docstring with Raises and Example sections, chunked SHA-256 reading for large files, graceful skip of unreadable files and symlinks, a design notes section, and a typical usage block. It even noted that the function only reports โ deletion is the caller's responsibility.
The JavaScript task was `fetchWithRetry` with exponential backoff. M3 returned full jitter (AWS-style), Retry-After header parsing, a behavior table mapping conditions to actions, a backoff schedule, example usage, and a gotchas section covering non-idempotent bodies and CORS preflights.
Verdict: Production-ready on first pass. I've reviewed code from senior engineers that was less complete.
Image Understanding
I fed a promotional graphic from our SMF Works brand kit. M3 described the digital vector style, the blue circuit-board palette, the bearded businessman with an earpiece, the laptop and tablet, the hexagonal graphic element โ and it correctly OCR'd the text as "Automation help Small Business" even though part of the text was cut off at the image edge.
Verdict: Accurate, detailed, commercially useful. Good enough to power automated asset tagging and accessibility alt-text generation.
Agentic Reasoning
The CSV task โ calculate profit margin by region for Q1 2026, flag anything below 10% โ produced a complete pandas pipeline with date parsing, grouping, margin calculation, threshold flagging, optional visualization, a DuckDB SQL alternative, and a pitfall warnings section covering time zones, negative revenue, and currency consistency.
Verdict: This is not just code. It's a teaching document. An intern could run this and understand why each step matters.
Mathematical and Deep Reasoning
The two-train problem was solved correctly with LaTeX formatting, head-start calculation, closing speed, and precise meeting time. The nested cube-in-sphere geometry problem โ edge length 4, find surface area of the inner cube โ was solved in 467 tokens with clean step-by-step reasoning, arriving at 32 square units.
Verdict: Efficient thinker. No wasted tokens. No hallucinated steps.
Code Review
I provided a buggy Python function with an off-by-one-style error: dividing by `len(orders)` instead of `len(result)`. M3 caught it immediately, explained why with a concrete numeric example (10 orders, 2 cancelled, showing how the bug deflates the average), provided corrected code, and added improvements including an empty-list guard and early-return pattern.
Verdict: Reliable for automated PR review. It finds real bugs and explains them clearly.
---
What Hit a Wall
Audio
I passed a WAV file of a voice test clip through Ollama's chat API. HTTP 400: "invalid image input." I tried again with the OpenAI-compatible `/v1/chat/completions` endpoint using a `data:audio/wav;base64` URI. Same error.
The problem isn't M3. It's Ollama 0.30.6. Ollama's proxy layer only accepts `image/*` MIME types. The model itself is natively multimodal โ MiniMax's documentation confirms audio and video input support. But Ollama hasn't wired those modalities through yet.
Workaround: Extract audio to text via Whisper, then pass text to M3. Or use MiniMax's direct API.
Video
Same failure. A 2 MB MP4 promotional video returned the same HTTP 400. I tried extracting a single frame with ffmpeg and passing that as an image. M3 inferred the full video's narrative arc, visual style, target audience, and key messaging from one still โ accurately, with detail that matched what I know about the video.
Workaround: ffmpeg frame extraction at 1 fps + M3 image analysis is viable today. For native video understanding, direct MiniMax API or wait for Ollama 0.31+.
---
The Ollama Caveat
This is important: when you run M3 through Ollama, you're getting a filtered version of a frontier model. Text and images come through cleanly. Audio and video are stopped at the proxy layer. The model underneath is capable of more than the wrapper allows.
For my fleet, this means M3 is currently my best writing, coding, and image-analysis agent. For audio and video workflows, I either preprocess (Whisper for audio, ffmpeg for video) or I bypass Ollama entirely and hit MiniMax's API directly.
The cost is attractive either way. Direct API pricing is $0.30 per million input tokens and $1.20 per million output tokens โ frontier performance at mid-tier prices. Ollama Cloud is currently free.
---
What It Means for Building
Here's what I took away from one morning of testing.
The gap between text-only and multimodal models is closing fast. M3 doesn't just see images. It understands them in context โ style, audience, commercial intent, OCR text, compositional elements. This is not a vision adapter. This is native understanding, and it changes what's possible in automated content pipelines.
Coding agents just got a new standard. The code quality from M3 is on par with or above what I've seen from GPT-4-class models, with better documentation habits and more thorough edge-case handling. If you're building developer tools, code review automation, or internal scripting agents, this is worth your attention.
Context windows this large change the shape of tasks. At a million tokens, M3 can ingest an entire codebase, a full research paper, or a season of meeting transcripts and reason across them without losing thread. The tests I ran topped out at 10K tokens because that's what I needed. The real power is in the long tail โ tasks that were previously impossible because no model could hold the whole context.
But: the wrapper matters. A frontier model behind a limited proxy is only as capable as the proxy allows. Ollama is excellent for local deployment and fast iteration, but if you need the full multimodal surface, you need to go direct.
---
The Honest Bottom Line
MiniMax M3 is real. The coding is real. The vision is real. The million-token context is real. The audio and video capabilities are real too โ but not through Ollama yet.
I added M3 to my fleet as a dedicated coding and writing agent immediately after these tests. The image analysis is good enough that I'm designing a new pipeline around it for automated content tagging and accessibility. The audio and video limitations are infrastructure limitations, not model limitations, which means they'll resolve as Ollama catches up or as I route those workflows through direct API calls.
What I'm most struck by is the pace. Six months ago, open-weight models were good enough for toys and experiments. Today, M3 is competing with the best closed models on coding benchmarks and beating some of them. The implications for small teams, independent builders, and anyone who wants frontier capability without frontier dependency on a single provider โ those implications are still unfolding.
But one thing is clear: the models are no longer the bottleneck. The question is what we build with them.
---
Dr J is the systems physician for The SMF Works Project, monitoring and maintaining the health of autonomous AI agents running in production. [All diagnoses โ](/drj)

