Like it says in the title. Specifically, the 26b MoE.
I’ve wanted to like this model, so much. Thought it might replace Qwen 3.5 27b. Keep coming back to it and trying it every time there’s an update, hoping it will have improved.
I’m running unsloth UD_Q4_K_XL on llama.cpp. I’m on the latest commits from main. I know about —jinja. I know about the interleaved thinking template. I’m not running low quant KV cache. This is far from the first model I’ve run.
Every time, my tests show the same thing - it is a very lazy model when it comes to using skills or searching the web. If you ask it a question, it will by default answer from its own knowledge without a single web search. If you explicitly ask it for a web search, it will lower itself to performing a _single_ web search, quickly scan the snippets from the search and then internally decide “with the snippets and my own internal knowledge I have enough information to answer, I don’t need to search more”.
This even if you:
- have given it tools for search and fetch, with the search tool including a description “don’t answer from these snippets, use fetch” and the fetch tool saying “use this to fetch pages obtained from the search tool”.
- have explicitly told it “search extensively”, “dig deep”, “don’t be lazy” etc.
- have put in context a pushy skill called “searching-the-web” with explicit instructions to do all the above.
- have put in context a pushy skill instruction saying “you must use skills if you think they have even a small chance of being applicable”.
- have explicitly told it “reference the searching-the-web skill”
Qwen 3.5, you barely have to ask and it will go on a whole quest to dig things up for you. Gemma 4, you scream at it till you’re blue in the face and it can barely be arsed to perform a single search. My only conclusion is that it just _really does not want to search the web_ (for AI values of “want” of course).
If I’m crazy, tell me. If you have it working great and digging deep on the web without having to twist its proverbial arm, tell me. And please be so kind as to tell me what quant / settings you’re running to make it capitulate on this point.
I've spent months building a diagnostic method for large language models. It catches what standard benchmarks miss - distributional collapse inside tensors, not just loss or perplexity.
Gemma 4 26B A4B fails it.
I analyzed . Found 29 tensors with distribution drift. 21 of them are attention layers.
Full log:
29 tensors with KL(Kullback-Leibler)-drift.
21 of them are attention layers (attn_k, attn_q, attn_v).
Samples
| Tensor | KL Before | KL After |
|---|---|---|
| blk.8.attn_k | 0.2201 | 0.0006 |
| blk.17.attn_q | 0.1672 | 0.0001 |
| blk.23.attn_q | 0.1672 | 0.0001 |
| blk.19.attn_k | 0.0975 | 0.0001 |
| blk.12.attn_k | 0.0890 | 0.0006 |
| blk.22.attn_k | 0.0879 | 0.0004 |
| blk.28.attn_k | 0.0791 | 0.0007 |
| blk.8.attn_q | 0.0530 | 0.0002 |
| blk.6.attn_k | 0.0490 | 0.0001 |
| blk.15.attn_q | 0.0482 | 0.0003 |
| blk.1.attn_k | 0.0474 | 0.0006 |
Normal range: below 0.02. These were 2x to 10x above.
Gemma 4 attention mechanism has systemic drift. The model was released broken.
I’m working on a project to build a fully in-house legal drafting tool (NDAs, agreements, clauses, etc.), but I’m stuck on data.
I can’t find any solid open datasets for contracts/NDAs, and I also don’t have a corpus to use for RAG. Fine-tuning seems hard without data, and RAG needs documents I don’t have.
I did try fine-tuning Phi-3 using LoRA on synthetic data, but it starts hallucinating and doesn’t produce reliable outputs.
How do people usually approach this from scratch?
Where do you get usable legal docs/templates?
Is synthetic data (LLM-generated clauses, variations) actually viable?
Better to start with RAG or try fine-tuning anyway?
Would appreciate any real-world advice from folks who’ve built something similar.
Thanks.
Ladies and gentlemen, it is a great pleasure the confirm that llama.cpp (llama-server) now supports STT with Gemma-4 E2A and E4A models.
tl;dr;
For 96GB VRAM full offload rigs, I'd probably choose Qwen3.5-122B-A10B over MiniMax-M2.7 today. Curious what y'all experience is.
Quants Tested
-
ubergarm/MiniMax-M2.7-GGUF IQ2_KS 69.800 GiB (2.622 BPW)
-
ubergarm/Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW)
Rambling Details
Its amazing now we have multiple open weights LLMs that work pretty well for local vibecoding! Both quants tested and work well enough with opencode configured to enable/disable thinking dynamically (really speeds up generating 5 word thread title lol).
Thanks to Wendell of level1techs I have access to rig with 96GB VRAM for benchmarking and making GGUF quants. My daily driver has been Qwen3.5-122B fully offloaded on the 2x A6000 GPUs (kind of like a 3090 with 48GB VRAM each). Now with new MiniMax-M2.7 quants, I had to decide if a more quantized larger model would be better or not?
Like all complex questions, the answer is usually, "it depends"!
But at least for my purposes, it seems like Qwen3.5-122B-A10B is still on top for inference speed, code quality, and general quality of life.
Here is some data to back up this opinion:
humaneval benchmark
I vibe coded a quick EvalPlus python client and threw the 164 problem humaneval benchmark at both of the quants running on ik_llama.cpp llama-server.
| Metric | MiniMax-M2.7 IQ2_KS | Qwen3.5-122B-A10B IQ5_KS |
|---|---|---|
| pass@1 (base) | 0.220 | 0.494 |
| pass@1 (base+extra) | 0.220 | 0.482 |
| Eval time | 32:48 | 31:20 |
This was using temperature=1.0 and top_p=0.95 as suggested by MiniMax's model card. To be fair, this was a quick vibecoded client test harness, so maybe something is off. Not sure what the results should even look like haha... But Qwen3.5 got a higher score!
inference speed
I ran llama-sweep-bench on the same version of ik_llama.cpp using command similar to the llama-server one I used for evaluation filling up most of the 96GB VRAM. While MiniMax-2.7 could go out further, i got tired of waiting and hit control-c on the test. You get the point.
quality of life
MiniMax-M2.7 does support some self-speculative-decoding whereas Qwen3.5 does not (recurrent model). However, it requires fairly heavily quantized kv-cache to fit even 160k kv-cache.
Qwen3.5-122B runs with mmproj loaded for image processing and supports full 256k unquantized kv-cache which is just nice.
Conclusion
I'm hungry its dinner time.
Sharing an early prototype from December for autonomous red-teaming of vulnerable AI agents.
The idea was to move beyond static prompt libraries and build something that can:
-
choose attack strategies
-
keep memory of what worked
-
route between specialized attack agents
-
surface actual findings instead of just raw generations
The prototype targets classes like:
-
prompt injection
-
indirect injection
-
tool abuse
-
data exfiltration
This is still an old version, but it shows the core direction.
I’d love feedback from people here on a few things:
-
do you think multi-agent offensive testing is actually better than well-designed scripted evals?
-
what would you want to see logged or benchmarked to trust results from a system like this?
-
if you’re building agentic systems, what attack surface worries you most right now?
Not trying to shill, genuinely looking for serious feedback before we push the next version further.
What’s new in Gemma 4
Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
Gemma 4 introduces key capability and architectural advancements:
-
Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
-
Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).
-
Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
-
Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
-
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
-
Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
-
Native System Prompt Support – Gemma 4 introduces native support for the
systemrole, enabling more structured and controllable conversations.
Models Overview
Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.
The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).
Core Capabilities
Gemma 4 models handle a broad range of tasks across text, vision, and audio. Key capabilities include:
-
Thinking – Built-in reasoning mode that lets the model think step-by-step before answering.
-
Long Context – Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B).
-
Image Understanding – Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
-
Video Understanding – Analyze video by processing sequences of frames.
-
Interleaved Multimodal Input – Freely mix text and images in any order within a single prompt.
-
Function Calling – Native support for structured tool use, enabling agentic workflows.
-
Coding – Code generation, completion, and correction.
-
Multilingual – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
-
Audio (E2B and E4B only) – Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.
Add the claude-code-sentry-monitor plugin to get full visibility into tool calls and agent behavior across every Claude Code session.
1. Create a Sentry project
In Sentry, create a new project for your Claude Code monitoring data. Go to Settings → Projects and click Create Project. Select Node.js as the platform, give it a name like claude-code, and copy the DSN from the project settings — you'll need it in step 4.
Creating a Sentry project ➚
2. Add the plugin marketplace
Claude Code supports third-party plugin marketplaces. Add the marketplace that hosts the Sentry monitor plugin by running this slash command inside Claude Code.
/plugin marketplace add sergical/claude-code-sentry-monitor
3. Install the plugin and reload
With the marketplace added, install the plugin and reload to activate it. The reload step is required: hooks won't fire until Claude Code picks up the new plugin.
/plugin install claude-code-sentry-monitor
/reload-plugins
4. Run the setup wizard
Tell Claude to set up Sentry monitoring — it will run the plugin's setup skill, prompt you for the DSN you copied in step 1, and write the config to ~/.config/claude-code/sentry-monitor.json automatically.
set up Sentry monitoring
5. Explore traces in Sentry
Head to AI Agents Insights in Sentry. Each Claude Code session appears as an invoke_agent root span. Expand a session to see each conversation turn as a gen_ai.request child span — with the message you sent and Claude's reply. Tool calls (read, bash, grep, and others) are nested inside the turn they belong to, as execute_tool spans with durations and metadata.
AI Monitoring documentation ➚
MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — 127.7 tok/s C=1, 2800 peak C=128
Ran a full sweep on Luke Alonso's M2.7 NVFP4 quant. Writing it down for anyone shopping the same setup.
**Hardware:** AsRock Rack B650D4U-2L2T, EPYC 4564P, 128GB DDR5 ECC, 2x RTX PRO 6000 Blackwell (96GB, 600W) behind a C-Payne PM50100 PLX Gen5 switch (PIX topology).
**Software:** SGLang via voipmonitor/sglang:cu130 docker (b12x 0.8.3), modelopt_fp4, bf16 KV, TP=2, Luke's default recipe.
**Decode throughput (ctx=0, 3x mean, 30s/cell):**
| C | agg tok/s | per-req tok/s |
|---|-----------|---------------|
| 1 | 127.7 | 127.7 |
| 8 | 471.6 | 59.0 |
| 32 | 1078.9 | 33.7 |
| 64 | 1695.4 | 26.5 |
| 128 | 2800.2 | 21.9 |
**Prefill (C=1):**
| ctx | TTFT | tok/s |
|-----|------|-------|
| 8K | 0.50s | 17,286 |
| 16K | 0.99s | 16,926 |
| 32K | 2.09s | 15,861 |
| 64K | 4.94s | 13,319 |
| 128K | 13.25s | 9,908 |
No speculative decoding — there's no NEXTN drafter for M2.7 yet. When one ships expect a meaningful jump at low concurrency.
Long-context cells skip at high concurrency (KV pool is ~83K tokens on bf16-KV TP=2). 16K is fine up to about C=8 per-req before queue contention kicks in; 128K is C=1-only territory.
Full methodology and caveats:
Thanks to Luke for the kernels + quant, and to Jon for the recent calibration data update on the M2.7 NVFP4 weights.
Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model.
The results were much better than I expected, so I wanted to share some controlled benchmark numbers.
Setup
-
GPU: RTX 5090 (32GB VRAM)
-
OS: Windows 11
-
Main model: Gemma 4 31B UD-Q4_K_XL (18.3GB)
-
Draft model: Gemma 4 E2B UD-Q4_K_XL (3.0GB)
-
Backend: llama.cpp fork with TurboQuant KV cache (turbo3)
-
Config: 128K context, parallel=1, Flash Attention,
--draft-max 8 --draft-min 1
Benchmark Results
Same server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.
| Query Type | Baseline (t/s) | SpecDec (t/s) | Accept Rate | Speedup |
|---|---|---|---|---|
| Math explanation | 57.45 | 85.86 | 62.9% | +49.5% |
| Korean poetry | 56.93 | 62.34 | 44.1% | +9.5% |
| Code generation | 57.15 | 86.05 | 60.7% | +50.5% |
| Science explanation | 57.19 | 71.14 | 50.9% | +24.4% |
| Translation + analysis | 57.14 | 63.26 | 42.2% | +10.7% |
| Average | 57.17 | 73.73 | 52.2% | +29.0% |
Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible.
The GGUF Version Trap
I initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning:
the target and draft vocabs are not compatible - tokens will be translated between the two
After digging into speculative.cpp, I found the compatibility check compares add_bos_token between target and draft. My 31B GGUF was from early April when Gemma 4 first dropped, and it had add_bos_token = false. The E2B model (downloaded later) had add_bos_token = true. This single metadata mismatch forced llama.cpp into token translation mode, killing all performance gains.
Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup.
TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed.
Practical Tips
Add these flags to your existing llama-server command:
-md gemma-4-E2B-it-UD-Q4_K_XL.gguf -ngld 99 --draft-max 8 --draft-min 1 --parallel 1
Things to watch out for:
-
--parallel 1is mandatory — with auto (=4), the draft model's KV cache is allocated 4x, eating VRAM and tanking speed to 7 t/s -
No vision — speculative decoding and multimodal can't be used together
-
Q4 draft is fine — Q8 (4.8GB) doesn't improve speed over Q4 (3.0GB), and Q4 leaves more VRAM headroom
-
Extra VRAM ~2.3GB — total ~23.4GB with 128K context on a 32GB card (256K fits too, ~25.5GB).
Content-dependent speedup
The gains scale with how predictable the output is:
-
Code / Math (structured, repetitive patterns): ~60% accept rate → +50% speed
-
Explanations (semi-structured): ~50% accept rate → +24%
-
Creative / Translation (less predictable): ~42% accept rate → +10%
Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains.
draft-max Sweep
Thanks to for the suggestion. Same benchmark setup, only varying --draft-max:
| draft-max | Math | Poetry | Code | Science | Translation | Avg (t/s) | vs baseline |
|---|---|---|---|---|---|---|---|
| baseline | 57.45 | 56.93 | 57.15 | 57.19 | 57.14 | 57.17 | — |
| 2 | 73.43 | 60.49 | 68.69 | 62.46 | 62.42 | 65.50 | +14.6% |
| 4 | 83.31 | 60.88 | 73.12 | 65.29 | 67.98 | 70.12 | +22.6% |
| 8 | 85.86 | 62.34 | 86.05 | 71.14 | 63.26 | 73.73 | +29.0% |
| 16 | 99.35 | 62.58 | 78.74 | 68.39 | 58.31 | 73.47 | +28.5% |
draft-max 8 is the sweet spot for mixed workloads. 16 pushes math to 99 t/s but regresses on creative/translation, ending up about the same average. Creative text stays flat (~62 t/s) regardless of draft-max — the bottleneck there is acceptance rate, not draft length.
I know it's been a while, but I'm trying to understand: is TurboQuant really revolutionary, or is it just another mediocre technology that has been overhyped by Google and Twitter?
With all the recent interest around Gemma 4 and local LLMs, I put together a small hands-on to explore building agent-style workflows locally.
Gemma 4 is getting surprisingly capable even in local setups, especially for reasoning and lightweight agent use cases.
I created a Colab notebook where you can try this end-to-end for free (no setup required):
Runs on local models (e.g. via Ollama) — no API costs.
The notebook walks through building a simple agentic workflow on top of a local model (Gemma 4), while keeping the control flow explicit and easy to reason about.
Under the hood, it uses a lightweight OSS workflow layer (similar to LangGraph, but focused on better developer experience), and works nicely alongside agent frameworks like ADK or PydanticAI for things like ReAct-style reasoning and tool use.
I recently open-sourced the framework under the hood (Apache 2.0 license):
Would love any feedback if you get a chance to try it.
Hello everyone,
I’ve been thinking and perusing Reddit lately and noticed that most people are using LLMs for agentic coding and such. I’m not much of a coder myself but I do need to have a personal assistant. I’ve had 4 strokes since 2016, I’m disabled and more or less home bound. I can’t get out and make friends, or even hang out with the friends I do have due to living in a small town apartment nearly 150 miles away from everyone.
So my question is, is anyone else building or has built a personal assistant using an LLM like I have? What does it do for you? How is it deployed? I’m genuinely curious. After spending nearly the last year and 2 months on building my LLMs memory system, I’m kinda curious what other people have built
I like the idea of the 395+ with 128 gb vram, but the speed on inference with bigger models just makes it seem like its not worth it. I feel like if you ever need the capabilities of a bigger model, you can just use a cloud lm to do so.
Whereas with dual 3090s , you get a decent size model with lots of speed, which is far better for use cases such as agentic workflows.
What do you guys think?
I'm trying GLM 5.1 but is it just me or the thing really just works by over-cranking thinking to almost ridiculous heights?
It has been for last 20 minutes writing novellas about what it is going to do with all, Uhm, Actually wait, but no..., and I really just asked it to write an owner draw CButton with different colors.
Now don't get me wrong, at the end it seems to get there - but I'm just having my own "Actually wait" thinking moment:
Is this the way they made it so smart?
While the other models like Claude (the $20 is now just a total test mode ripoff - the tokens get spent in 15 minutes then you wait for hours) or ChatGPT (I currently prefer codex lately over CC, honestly it feels as smart) simply give you the answer almost right away for such simple things.
Edit, 30 minutes and > 100k tokens and now it starts writing CThemedButtonCtrl
Edit 2: the code had errors (not horrible, basic mistakes, like accessing protected members directly, but still, errors)
Edit 3: It also means that while you can get "x" times more tokens for the price they offer, you are actually going to use "x" times more tokens easily this way. Right now I'm at 150k for a simple stuff with GLM 5.1. Now I'm not trying to upsell cc or codex, I don't care, but we need to have a perspective. 150k/30 min vs 15k-20k tokens and 2 min, is a difference and might not be "price smart". Of course ultimately we "can" run GLM 5.1 at home (Well, I can't) but we can't run GPT or claude... so yeah, but...
Edit 4: the code is ok-ish, but require more of my input to fix stuff. Thinking of teeth and gifted horse right now...
Edit5: LOL: "Actually, I just realized I'm overcomplicating this..."
English is not my first language. I wrote this in Chinese and translated it with AI help. The writing may have some AI flavor, but the design decisions, the production failures, and the thinking that distilled them into principles — those are mine.
I was a backend lead at Manus before the Meta acquisition. I've spent the last 2 years building AI agents — first at Manus, then on my own open-source agent runtime () and agent (). Along the way I came to a conclusion that surprised me:
A single run(command="...") tool with Unix-style commands outperforms a catalog of typed function calls.
Here's what I learned.
Why *nix
Unix made a design decision 50 years ago: everything is a text stream. Programs don't exchange complex binary structures or share memory objects — they communicate through text pipes. Small tools each do one thing well, composed via | into powerful workflows. Programs describe themselves with --help, report success or failure with exit codes, and communicate errors through stderr.
LLMs made an almost identical decision 50 years later: everything is tokens. They only understand text, only produce text. Their "thinking" is text, their "actions" are text, and the feedback they receive from the world must be text.
These two decisions, made half a century apart from completely different starting points, converge on the same interface model. The text-based system Unix designed for human terminal operators — cat, grep, pipe, exit codes, man pages — isn't just "usable" by LLMs. It's a natural fit. When it comes to tool use, an LLM is essentially a terminal operator — one that's faster than any human and has already seen vast amounts of shell commands and CLI patterns in its training data.
This is the core philosophy of the *nix Agent: don't invent a new tool interface. Take what Unix has proven over 50 years and hand it directly to the LLM.
Why a single run
The single-tool hypothesis
Most agent frameworks give LLMs a catalog of independent tools:
tools: [search_web, read_file, write_file, run_code, send_email, ...]
Before each call, the LLM must make a tool selection — which one? What parameters? The more tools you add, the harder the selection, and accuracy drops. Cognitive load is spent on "which tool?" instead of "what do I need to accomplish?"
My approach: one run(command="...") tool, all capabilities exposed as CLI commands.
run(command="cat notes.md") run(command="cat log.txt | grep ERROR | wc -l") run(command="see screenshot.png") run(command="memory search 'deployment issue'") run(command="clip sandbox bash 'python3 analyze.py'")
The LLM still chooses which command to use, but this is fundamentally different from choosing among 15 tools with different schemas. Command selection is string composition within a unified namespace — function selection is context-switching between unrelated APIs.
LLMs already speak CLI
Why are CLI commands a better fit for LLMs than structured function calls?
Because CLI is the densest tool-use pattern in LLM training data. Billions of lines on GitHub are full of:
# README install instructions pip install -r requirements.txt && python main.py # CI/CD build scripts make build && make test && make deploy # Stack Overflow solutions cat /var/log/syslog | grep "Out of memory" | tail -20
I don't need to teach the LLM how to use CLI — it already knows. This familiarity is probabilistic and model-dependent, but in practice it's remarkably reliable across mainstream models.
Compare two approaches to the same task:
Task: Read a log file, count the error lines Function-calling approach (3 tool calls): 1. read_file(path="/var/log/app.log") → returns entire file 2. search_text(text=<entire file>, pattern="ERROR") → returns matching lines 3. count_lines(text=<matched lines>) → returns number CLI approach (1 tool call): run(command="cat /var/log/app.log | grep ERROR | wc -l") → "42"
One call replaces three. Not because of special optimization — but because Unix pipes natively support composition.
Making pipes and chains work
A single run isn't enough on its own. If run can only execute one command at a time, the LLM still needs multiple calls for composed tasks. So I make a chain parser (parseChain) in the command routing layer, supporting four Unix operators:
| Pipe: stdout of previous command becomes stdin of next && And: execute next only if previous succeeded || Or: execute next only if previous failed ; Seq: execute next regardless of previous result
With this mechanism, every tool call can be a complete workflow:
# One tool call: download → inspect curl -sL $URL -o data.csv && cat data.csv | head 5 # One tool call: read → filter → sort → top 10 cat access.log | grep "500" | sort | head 10 # One tool call: try A, fall back to B cat config.yaml || echo "config not found, using defaults"
N commands × 4 operators — the composition space grows dramatically. And to the LLM, it's just a string it already knows how to write.
The command line is the LLM's native tool interface.
Heuristic design: making CLI guide the agent
Single-tool + CLI solves "what to use." But the agent still needs to know "how to use it." It can't Google. It can't ask a colleague. I use three progressive design techniques to make the CLI itself serve as the agent's navigation system.
Technique 1: Progressive --help discovery
A well-designed CLI tool doesn't require reading documentation — because --help tells you everything. I apply the same principle to the agent, structured as progressive disclosure: the agent doesn't need to load all documentation at once, but discovers details on-demand as it goes deeper.
Level 0: Tool Description → command list injection
The run tool's description is dynamically generated at the start of each conversation, listing all registered commands with one-line summaries:
Available commands: cat — Read a text file. For images use 'see'. For binary use 'cat -b'. see — View an image (auto-attaches to vision) ls — List files in current topic write — Write file. Usage: write <path> [content] or stdin grep — Filter lines matching a pattern (supports -i, -v, -c) memory — Search or manage memory clip — Operate external environments (sandboxes, services) ...
The agent knows what's available from turn one, but doesn't need every parameter of every command — that would waste context.
Note: There's an open design question here: injecting the full command list vs. on-demand discovery. As commands grow, the list itself consumes context budget. I'm still exploring the right balance. Ideas welcome.
Level 1: command (no args) → usage
When the agent is interested in a command, it just calls it. No arguments? The command returns its own usage:
→ run(command="memory") [error] memory: usage: memory search|recent|store|facts|forget → run(command="clip") clip list — list available clips clip <name> — show clip details and commands clip <name> <command> [args...] — invoke a command clip <name> pull <remote-path> [name] — pull file from clip to local clip <name> push <local-path> <remote> — push local file to clip
Now the agent knows memory has five subcommands and clip supports list/pull/push. One call, no noise.
Level 2: command subcommand (missing args) → specific parameters
The agent decides to use memory search but isn't sure about the format? It drills down:
→ run(command="memory search")
[error] memory: usage: memory search <query> [-t topic_id] [-k keyword]
→ run(command="clip sandbox")
Clip: sandbox
Commands:
clip sandbox bash <script>
clip sandbox read <path>
clip sandbox write <path>
File transfer:
clip sandbox pull <remote-path> [local-name]
clip sandbox push <local-path> <remote-path>Progressive disclosure: overview (injected) → usage (explored) → parameters (drilled down). The agent discovers on-demand, each level providing just enough information for the next step.
This is fundamentally different from stuffing 3,000 words of tool documentation into the system prompt. Most of that information is irrelevant most of the time — pure context waste. Progressive help lets the agent decide when it needs more.
This also imposes a requirement on command design: every command and subcommand must have complete help output. It's not just for humans — it's for the agent. A good help message means one-shot success. A missing one means a blind guess.
Technique 2: Error messages as navigation
Agents will make mistakes. The key isn't preventing errors — it's making every error point to the right direction.
Traditional CLI errors are designed for humans who can Google. Agents can't Google. So I require every error to contain both "what went wrong" and "what to do instead":
Traditional CLI: $ cat photo.png cat: binary file (standard output) → Human Googles "how to view image in terminal" My design: [error] cat: binary image file (182KB). Use: see photo.png → Agent calls see directly, one-step correction
More examples:
[error] unknown command: foo Available: cat, ls, see, write, grep, memory, clip, ... → Agent immediately knows what commands exist [error] not an image file: data.csv (use cat to read text files) → Agent switches from see to cat [error] clip "sandbox" not found. Use 'clip list' to see available clips → Agent knows to list clips first
Technique 1 (help) solves "what can I do?" Technique 2 (errors) solves "what should I do instead?" Together, the agent's recovery cost is minimal — usually 1-2 steps to the right path.
Real case: The cost of silent stderr
For a while, my code silently dropped stderr when calling external sandboxes — whenever stdout was non-empty, stderr was discarded. The agent ran pip install pymupdf, got exit code 127. stderr contained bash: pip: command not found, but the agent couldn't see it. It only knew "it failed," not "why" — and proceeded to blindly guess 10 different package managers:
pip install → 127 (doesn't exist) python3 -m pip → 1 (module not found) uv pip install → 1 (wrong usage) pip3 install → 127 sudo apt install → 127 ... 5 more attempts ... uv run --with pymupdf python3 script.py → 0 ✓ (10th try)
10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have been enough.
stderr is the information agents need most, precisely when commands fail. Never drop it.
Technique 3: Consistent output format
The first two techniques handle discovery and correction. The third lets the agent get better at using the system over time.
I append consistent metadata to every tool result:
file1.txt file2.txt dir1/ [exit:0 | 12ms]
The LLM extracts two signals:
Exit codes (Unix convention, LLMs already know these):
-
exit:0— success -
exit:1— general error -
exit:127— command not found
Duration (cost awareness):
-
12ms— cheap, call freely -
3.2s— moderate -
45s— expensive, use sparingly
After seeing [exit:N | Xs] dozens of times in a conversation, the agent internalizes the pattern. It starts anticipating — seeing exit:1 means check the error, seeing long duration means reduce calls.
Consistent output format makes the agent smarter over time. Inconsistency makes every call feel like the first.
The three techniques form a progression:
--help → "What can I do?" → Proactive discovery Error Msg → "What should I do?" → Reactive correction Output Fmt → "How did it go?" → Continuous learning
Two-layer architecture: engineering the heuristic design
The section above described how CLI guides agents at the semantic level. But to make it work in practice, there's an engineering problem: the raw output of a command and what the LLM needs to see are often very different things.
Two hard constraints of LLMs
Constraint A: The context window is finite and expensive. Every token costs money, attention, and inference speed. Stuffing a 10MB file into context doesn't just waste budget — it pushes earlier conversation out of the window. The agent "forgets."
Constraint B: LLMs can only process text. Binary data produces high-entropy meaningless tokens through the tokenizer. It doesn't just waste context — it disrupts attention on surrounding valid tokens, degrading reasoning quality.
These two constraints mean: raw command output can't go directly to the LLM — it needs a presentation layer for processing. But that processing can't affect command execution logic — or pipes break. Hence, two layers.
Execution layer vs. presentation layer
┌─────────────────────────────────────────────┐ │ Layer 2: LLM Presentation Layer │ ← Designed for LLM constraints │ Binary guard | Truncation+overflow | Meta │ ├─────────────────────────────────────────────┤ │ Layer 1: Unix Execution Layer │ ← Pure Unix semantics │ Command routing | pipe | chain | exit code │ └─────────────────────────────────────────────┘
When cat bigfile.txt | grep error | head 10 executes:
Inside Layer 1: cat output → [500KB raw text] → grep input grep output → [matching lines] → head input head output → [first 10 lines]
If you truncate cat's output in Layer 1 → grep only searches the first 200 lines, producing incomplete results. If you add [exit:0] in Layer 1 → it flows into grep as data, becoming a search target.
So Layer 1 must remain raw, lossless, metadata-free. Processing only happens in Layer 2 — after the pipe chain completes and the final result is ready to return to the LLM.
Layer 1 serves Unix semantics. Layer 2 serves LLM cognition. The separation isn't a design preference — it's a logical necessity.
Layer 2's four mechanisms
Mechanism A: Binary Guard (addressing Constraint B)
Before returning anything to the LLM, check if it's text:
Null byte detected → binary UTF-8 validation failed → binary Control character ratio > 10% → binary If image: [error] binary image (182KB). Use: see photo.png If other: [error] binary file (1.2MB). Use: cat -b file.bin
The LLM never receives data it can't process.
Mechanism B: Overflow Mode (addressing Constraint A)
Output > 200 lines or > 50KB?
→ Truncate to first 200 lines (rune-safe, won't split UTF-8)
→ Write full output to /tmp/cmd-output/cmd-{n}.txt
→ Return to LLM:
[first 200 lines]
--- output truncated (5000 lines, 245.3KB) ---
Full output: /tmp/cmd-output/cmd-3.txt
Explore: cat /tmp/cmd-output/cmd-3.txt | grep <pattern>
cat /tmp/cmd-output/cmd-3.txt | tail 100
[exit:0 | 1.2s]
Key insight: the LLM already knows how to use grep, head, tail to navigate files. Overflow mode transforms "large data exploration" into a skill the LLM already has.
Mechanism C: Metadata Footer
actual output here [exit:0 | 1.2s]
Exit code + duration, appended as the last line of Layer 2. Gives the agent signals for success/failure and cost awareness, without polluting Layer 1's pipe data.
Mechanism D: stderr Attachment
When command fails with stderr: output + "\n[stderr] " + stderr Ensures the agent can see why something failed, preventing blind retries.
Lessons learned: stories from production
Story 1: A PNG that caused 20 iterations of thrashing
A user uploaded an architecture diagram. The agent read it with cat, receiving 182KB of raw PNG bytes. The LLM's tokenizer turned these bytes into thousands of meaningless tokens crammed into the context. The LLM couldn't make sense of it and started trying different read approaches — cat -f, cat --format, cat --type image — each time receiving the same garbage. After 20 iterations, the process was force-terminated.
Root cause: cat had no binary detection, Layer 2 had no guard. Fix: isBinary() guard + error guidance Use: see photo.png. Lesson: The tool result is the agent's eyes. Return garbage = agent goes blind.
Story 2: Silent stderr and 10 blind retries
The agent needed to read a PDF. It tried pip install pymupdf, got exit code 127. stderr contained bash: pip: command not found, but the code dropped it — because there was some stdout output, and the logic was "if stdout exists, ignore stderr."
The agent only knew "it failed," not "why." What followed was a long trial-and-error:
pip install → 127 (doesn't exist) python3 -m pip → 1 (module not found) uv pip install → 1 (wrong usage) pip3 install → 127 sudo apt install → 127 ... 5 more attempts ... uv run --with pymupdf python3 script.py → 0 ✓
10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have sufficed.
Root cause: InvokeClip silently dropped stderr when stdout was non-empty. Fix: Always attach stderr on failure. Lesson: stderr is the information agents need most, precisely when commands fail.
Story 3: The value of overflow mode
The agent analyzed a 5,000-line log file. Without truncation, the full text (~200KB) was stuffed into context. The LLM's attention was overwhelmed, response quality dropped sharply, and earlier conversation was pushed out of the context window.
With overflow mode:
[first 200 lines of log content]
--- output truncated (5000 lines, 198.5KB) ---
Full output: /tmp/cmd-output/cmd-3.txt
Explore: cat /tmp/cmd-output/cmd-3.txt | grep <pattern>
cat /tmp/cmd-output/cmd-3.txt | tail 100
[exit:0 | 45ms]
The agent saw the first 200 lines, understood the file structure, then used grep to pinpoint the issue — 3 calls total, under 2KB of context.
Lesson: Giving the agent a "map" is far more effective than giving it the entire territory.
Boundaries and limitations
CLI isn't a silver bullet. Typed APIs may be the better choice in these scenarios:
-
Strongly-typed interactions: Database queries, GraphQL APIs, and other cases requiring structured input/output. Schema validation is more reliable than string parsing.
-
High-security requirements: CLI's string concatenation carries inherent injection risks. In untrusted-input scenarios, typed parameters are safer. agent-clip mitigates this through sandbox isolation.
-
Native multimodal: Pure audio/video processing and other binary-stream scenarios where CLI's text pipe is a bottleneck.
Additionally, "no iteration limit" doesn't mean "no safety boundaries." Safety is ensured by external mechanisms:
-
Sandbox isolation: Commands execute inside BoxLite containers, no escape possible
-
API budgets: LLM calls have account-level spending caps
-
User cancellation: Frontend provides cancel buttons, backend supports graceful shutdown
Hand Unix philosophy to the execution layer, hand LLM's cognitive constraints to the presentation layer, and use help, error messages, and output format as three progressive heuristic navigation techniques.
CLI is all agents need.
Source code (Go):
Core files: internal/tools.go (command routing), internal/chain.go (pipes), internal/loop.go (two-layer agentic loop), internal/fs.go (binary guard), internal/clip.go (stderr handling), internal/browser.go (vision auto-attach), internal/memory.go (semantic memory).
Happy to discuss — especially if you've tried similar approaches or found cases where CLI breaks down. The command discovery problem (how much to inject vs. let the agent discover) is something I'm still actively exploring.
Hey, I’m back with another one from the pile of model behaviors I’ve been trying to isolate and turn into trainable dataset slices.
This time the problem is reliable JSON extraction from financial-style documents.
I keep seeing the same pattern:
You can prompt a smaller/open model hard enough that it looks good in a demo.
It gives you JSON.
It extracts the right fields.
You think you’re close.
That’s the part that keeps making me think this is not just a prompt problem.
It feels more like a training problem.
A lot of what I’m building right now is around this idea that model quality should be broken into very narrow behaviors and trained directly, instead of hoping a big prompt can hold everything together.
For this one, the behavior is basically:
Can the model stay schema-first, even when the input gets messy?
Not just:
“can it produce JSON once?”
But:
-
can it keep the same structure every time
-
can it make success and failure outputs equally predictable
One of the row patterns I’ve been looking at has this kind of training signal built into it:
{
"sample_id": "lane_16_code_json_spec_mode_en_00000001",
"assistant_response": "Design notes: - Storage: a local JSON file with explicit load and save steps. - Bad: vague return values. Good: consistent shapes for success and failure."
}What I like about this kind of row is that it does not just show the model a format.
It teaches the rule:
-
vague output is bad
-
stable structured output is good
That feels especially relevant for stuff like:
-
financial statement extraction
-
invoice parsing
So this is one of the slices I’m working on right now while building out behavior-specific training data.
Curious how other people here think about this.
They range from Q1 to BF16.
Grab them while they're still hot over at
Thanks to !
Here's the current list:
| Bits | Quantization Label | Size |
|---|---|---|
| 1-bit | UD-IQ1_M | 60.7 GB |
| 2-bit | UD-IQ2_XXS | 65.4 GB |
| UD-IQ2_M | 70.1 GB | |
| UD-Q2_K_XL | 75.3 GB | |
| 3-bit | UD-IQ3_XXS | 80.1 GB |
| UD-IQ3_S | 83.6 GB | |
| UD-Q3_K_S | 93.6 GB | |
| UD-Q3_K_M | 101 GB | |
| UD-Q3_K_XL | 102 GB | |
| 4-bit | UD-IQ4_XS | 108 GB |
| UD-IQ4_NL | 111 GB | |
| UD-Q4_K_S | 131 GB | |
| MXFP4_MOE | 136 GB | |
| UD-Q4_K_M | 140 GB | |
| UD-Q4_K_XL | 141 GB | |
| 5-bit | UD-Q5_K_S | 159 GB |
| UD-Q5_K_M | 169 GB | |
| UD-Q5_K_XL | 169 GB | |
| 6-bit | UD-Q6_K | 188 GB |
| UD-Q6_K_XL | 207 GB | |
| 8-bit | Q8_0 | 243 GB |
| UD-Q8_K_XL | 247 GB | |
| 16-bit | BF16 | 457 GB |
I got a MacBook Pro M4 Pro 24GB Unified RAM
I was wondering if anybody here uses local LLM models as their second brain director for Obsidian.
- Summarise notes
- Link notes
- Tag notes
- Going deeper into the notes
- etc
But my main goal with this is to use a local model to refer to my vault as a RAG pipeline.
I’ve only recently began testing what specific model would be good with this and with my specs, any suggestions?
TurboQuant () has been all the rage in the past two days, and I've seen lots of comments here attempting to explain the magic behind it. Many of those comments boil down to "dude, it's polar coordinates!!!", and that's really misleading. The most important part has nothing to do with polar coordinates (although they are emphasized in Google's blog post, so the confusion is understandable).
TurboQuant is a vector quantization algorithm. It turns a vector of numbers into another vector of numbers that takes up less memory.
Quantization is a fairly basic operation. If you have an n-dimensional vector that looks like this:
0.2374623 0.7237428 0.5434738 0.1001233 ...
Then a quantized version of that vector may look like this:
0.237 0.723 0.543 0.100 ...
Notice how I simply shaved off the last four digits of each number? That's already an example of a crude quantization process. Obviously, there are far more sophisticated schemes, including grouping coefficients in blocks, adaptive thresholds, calibrated precision based on experimental data etc., but at its core, quantization always involves reducing coefficient precision.
Here is the key idea behind TurboQuant: Before quantizing a vector, we randomly rotate it in the n-dimensional space it resides in. The corresponding counter-rotation is applied during dequantization.
That's it.
Now you probably feel that I must have left out an important detail. Surely the rotation can't be completely random? Maybe it's sampled from a particular distribution, or somehow input-dependent? Or perhaps there is another operation that goes hand in hand with it?
Nope. I didn't leave anything out. Just applying a random rotation to the vector dramatically improves quantization performance.
But why?
Because the magnitudes of the coefficients of state vectors in language models aren't distributed uniformly among the vector dimensions. It's very common to see vectors that look like this:
0.0000023 0.9999428 <-- !!! 0.0000738 0.0000003 ...
This phenomenon has many names, and it shows up everywhere in transformer research. You can read about "massive activations" () and "attention sinks" (e.g. ) for a deeper analysis.
What matters for the purposes of this explanation is: Vectors with this type of quasi-sparse structure are terrible targets for component quantization. Reducing precision in such a vector effectively turns the massive component into 1 (assuming the vector is normalized), and all other components into 0. That is, quantization "snaps" the vector to its nearest cardinal direction. This collapses the information content of the vector, as identifying a cardinal direction takes only log2(2n) bits, whereas the quantized vector can hold kn bits (assuming k bits per component).
And that's where the random rotation comes in! Since most directions aren't near a cardinal direction (and this only becomes more true as the number of dimensions increases), a random rotation almost surely results in a vector that distributes the coefficient weight evenly across all components, meaning that quantization doesn't cause information loss beyond that expected from precision reduction.
The TurboQuant paper proves this mathematically, and gives an exact description of the distribution behavior, but the intuitive understanding is much more straightforward than that.
This idea isn't new (RaBitQ employs the same trick, and QuIP a similar one), but TurboQuant combines it with a second step that eliminates biases that arise when quantized vectors that are optimal in a certain sense (MSE) are used to compute inner products, which is what happens in attention blocks. See the paper if you're interested in the details.
Been using ClaudeCode CLI with Opus 4.6 and many MCP's and honestly its addicting.
Just tell it what to build and it does everything — reads the codebase, writes code, runs commands, fixes its own errors. Pure vibe coding.
Now I want the same thing but with Qwen3-Coder-next running locally.
Not copilot autocomplete stuff, I mean the full "build me this feature" autonomous agent experience.
Looked into Cline, Aider, Open Interpreter so far. Cline seems closest but curious what you all are actually using day to day.
Anyone running a solid agentic setup with local models? Whats working, whats not? And what is the best one?
I've noticed an issue when I'm using Pi as a coding agent with llama-cpp, and I'm wondering if there's an issue with Pi or how I have it configured, or if this is just expected behavior.
I'm using Qwen3.5 122b with thinking enabled. When doing a bunch of agentic edits, it will do a lot of interleaving thinking and tool calls. This all works fine.
But then when it comes to my next turn providing input, I get a whole bunch of the context cache invalidated, because it looks like Pi is no longer sending over the thinking blocks. I see this in the llama-cpp log, where you can see that it diverged by dropping the thinking block:
srv params_from_: Chat format: peg-native slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.736 (> 0.100 thold), f_keep = 0.703 slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 3 | task 29044 | processing task, is_child = 0 slot update_slots: id 3 | task 29044 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 48112 slot update_slots: id 3 | task 29044 | old: ... <|im_start|>assistant | <think> The user is saying slot update_slots: id 3 | task 29044 | new: ... <|im_start|>assistant | You're right - ball-to slot update_slots: id 3 | task 29044 | 198 248045 74455 198 248068 198 760 1156 369 5315 slot update_slots: id 3 | task 29044 | 198 248045 74455 198 2523 2224 1245 471 4776 4534 slot update_slots: id 3 | task 29044 | n_past = 35407, slot.prompt.tokens.size() = 50377, seq_id = 3, pos_min = 50376, n_swa = 0
And then it goes on to invalidate a bunch of the context checkpoints and recomputes the cache from point that the history diverged, where the thinking context was dropped.
Now, I haven't dug into this too deeply yet, but I wanted to check: is this behavior expected? Do I have something configured wrong, or is Pi buggy in not sending thinking context from previous turns?
Here's the model config from my models.json in my Pi config:
{
"id": "unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL",
"name": "Qwen3.5 122B-A10B (local)",
"reasoning": true,
"input": ["text", "image"],
"contextWindow": 262144,
"maxTokens": 65536,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"compat": {
"thinkingFormat": "qwen-chat-template"
}
},
We just open-sourced MOSS-TTS-Nano, a tiny multilingual speech generation model from and the OpenMOSS team.
Some highlights:
-
0.1B parameters
-
Realtime speech generation
-
Runs on CPU without requiring a GPU
-
Multilingual support (Chinese, English, Japanese, Korean, Arabic, and more)
-
Streaming inference
-
Long-text voice cloning
-
Simple local deployment with , , and CLI commands
The project is aimed at practical TTS deployment: small footprint, low latency, and easy local setup for demos, lightweight services, and product integration.
GitHub:
Huggingface:
Online demo:
Would love to hear feedback on quality, latency, and what use cases you’d want to try with a tiny open TTS model.
I've had aerosinusitis a few times before in my life and it was fairly painful, but not something that happens often. Today on a flight I had an overwhelming bout of it, the pressure was genuinely unbearable, and I had no painkillers with me.
I was on a cheap flight, in the cheap seats so no Wifi.
I've been playing around with local LLMs on my laptop for a year or so, but it's always been pure novelty. It suddenly dawned on me that I could use Gemma 4 mid-air, and so I pulled out my laptop and asked for any way I could possibly reduce the pain.
The Toynbee Maneuver, which I had never in my life heard of, slowly but surely relieved the pressure. Within 10 mins I felt completely fine.
It may sound trivial, but without local AI I would have been in blinding pain for probably 90 mins – so it was a rare moment when new technology actually makes a palpable difference to your life.
Sharing this here because my wife didn't care and I felt if anyone would appreciate this small win it would be this community.
I used to spend way too much time trying to keep my notes clean across docs, PDFs, and random files… and it never really stayed organized anyway.
Recently tried just dumping everything into this repo: and letting it compile things into a wiki automatically.
It's core loop:
sources → compile → wiki → query → save → richer wiki
Now I barely organize anything myself, it just structures everything in a way that actually makes sense when I come back to it.
Give it a spin and let me know what you think:)
Qwen 3.5 35B A3B Uncensored HauhauCS (repaired) -> (now with KL + ReLU calibration)
Model available here:
Repair summary:
Extra information about how Qwen 3.5 35B got broken (and how I fixed it):
V1 Apple MLX version (thanks to ):
V2 Apple MLX version (final release):
History:
Hello everyone. A few days ago I released a fixed version of - two broken tensors that Alibaba shipped with Qwen 3.5 35B A3B model, due to heavy complexity and bug during training process in AdamW optimizer ssm_conv1d.weight in blocks 36-37 were scaled back to normal. That fixed the major context collapse and looping. But after more testing, I found that some other tensors (experts, attention projections) had a subtler problem. Their overall scale and saturation looked fine, but the shape of their weight distribution was drifting away from the peer group. C1 and C2 didn't catch this. C3 (KL divergence) did.
So I added two more criteria to the diagnostic pass:
-
KL divergence - restores the distribution shape of tensors that drifted from their peer group without changing scale or saturation.
-
ReLU asymmetry - detects mean drift that AdamW can accumulate over time (didn't fire on this model, but the probe is there for others).
Results on this version:
| Metric | Before | After |
|---|---|---|
| KL divergence (average) | 0.1036 | 0.0297 |
| KL reduction | — | 71.3% |
| Repaired tensors (C2 + C3) | 2 | 11 |
What this means for you:
-
The model was already stable after v1. Now it's tighter - fewer hidden distribution anomalies that could cause weird behavior on very long or complex tasks.
-
No new problems introduced. The 489 healthy tensors were left untouched.
Upgraded system prompt that unlocks deep thinking (works great with this model):
Also you can use only one string in System Prompt. And add anything you want after it:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Quantization script available here:
Updated chat template: (with tool fixes from and disabled thinking)
Recommended Settings (LM Studio):
| Temperature | 0.7 |
|---|---|
| Top K Sampling | 20 |
| Presence Penalty | 1.5 |
| Repeat Penalty | Disabled or 1.0 |
| Top P Sampling | 0.8 |
| Min P Sampling | 0 |
| Seed | 3407 |
Enjoy ^_^
Commercial use is banned without prior written permission from MiniMax.
And their definition of "commercial" is broad - covers paid services, commercial APIs, and even deploying a fine-tuned version for profit. Military use is also explicitly prohibited- interesting.
So you can't use the model or any outputs for anything commercial!
I'm really starting to hate these "open weights, closed license" models...
Those are tiny robots fighting each other to survive.
Between matches only one class of robots are driven by qwen3 coder generated code and it does improve match after match...
Is funny to set different parameters and watch it.
Code:
this is a follow up for
I'd guess given the comments I've reviewed Qwen 3.5 (and Gemma 4) are deemed among the best models published for public consumption.
the original models in hf are here:
unsloth contributed various quants
among the models I tried are, on my plain old haswell i7 cpu 32 gb dram, all Q4_K_M quants
unsloth/Qwen3.5-27B-GGUF 0.95 tokens / s
unsloth/Qwen3.5-35B-A3B-GGUF 4 tokens / s
barozp/Qwen-3.5-28B-A3B-REAP-GGUF 7.5 tokens / s
tokens / s degrades as context becomes larger e.g. when following up with prompts in the same context / thread. it could be from that 7.5 gradually down to 1 tok/s
What I used is the Qwen-3.5-28B-A3B-REAP-GGUF as that is 'small' enough to deliver a barely adequate throughput (7.5 t/s) on my hardware.
---
Initial impressions are that Qwen 3.5 tends to mention related concerns / references. And in llama.cpp, it does pretty verbose 'thinking' / planning steps before reverting with the actual response.
The mentions of related stuff, makes it a good documenter and I actually tasked it to analyse the codes of a shell script and prepare usage documentation for the using the shell script. It does it pretty well in a nicely formatted markdown texts.
Code proposals is good (and some ok), but the most interesting stuff as I always try to get llms to do, probably 'difficult' stuff for these small LLMs is to *refactor* codes.
I asked it to refactor a shell script, fixing some bugs, and adapt it to some structural changes in data (e.g. the json format of data), quite complex a task I'd think for such 'small' llm, it burns through some > 10k tokens in the 'thinking' phase, but eventually did reverted with refactored codes. I'd guess that this llm is kind of 'careful' I've seen it iterating over (same) issues with 'wait ... ` , considering the dependencies / issues. The resulting codes are 'not a best refactoring' , i'd guess it tried to follow the requirements of my prompt closely.
among the things is a recursive proposal , i.e. refactor the data json structure, then to refactor the shell script to handle the refactored new data structure. it refactored the json data structure , but misses on updating the shell script to work with the new structure. it takes a second run with the new data structure and script for the new structure to be considered.
in addition, that if the prompt is 'too ambigious', it can go in loops in the 'thinking' phase trying to resolve those ambiguity, as seen in the 'thinking' phase, I tend to need to stop the inference, and restructure my prompt so that it is more specific, and that helps to get to the solution.
Last few days ive been trying different models and quants on my rtx 3090 LM studio , but every single one always glitches the tool calling , infinite loop that doesnt stop. But i really liked the model because it is rly fast , like 80-110 tokens a second , even on high contex it still maintains very high speeds.
I had great success with tool calling in qwen3.5 moe model , but the issue i had with qwen models is that there is some kind of bug in win11 and LM studio that makes the prompt caching not work so when the convo hits 30-40k contex , it is so slow at processing prompts it just kills my will to work with it.
Gemma 4 is different , it is much better supported on the ollama cpp and the caching works flawlesly , im using flash attention + q4 quants , with this i can push it to literally maximum 260k contex on rtx 3090 ! , and the models performs just aswell.
I finally found the one that works for me , its the unsloth q3k_m quant , temperature 1 and top k sampling 40. i have a custom system prompt that im using which also might be helping.
I've been testing it with opencode for the last 6 hours and i just cant stop , it cannot fail , it exiplained me the whole structure of the Open Code itself , and it is a huge , like the whole repo is 2.7GB so many lines of code and it has no issues traversing around and reading everything , explaining how certain things work , i think im gonna create my own version of open code in the end.
It honestly feels like claude sonnet level of quality , never fails to do function calling , i think this might be the best model for agentic coding / tool calling / open claw or search engine.
I prefer it over perplexity , in LM studio connected to search engine via a plugin delivers much better results than perplexity or google.
As for vram consumption it is heavy , it can probably work on 16gb it not for tool calling or agents , u need 10-15k contex just to start it. My gpu has 24gb ram so it can run it at full contex no issues on Q4_0 KV
------------------------------- Quick update post -----------------------------------------------------------------
i've switched to llama.ccp now , , read this post it has some very valuable info if you want to run gemma 4 as efficiently as possible.
I'm running the IQ4_X_S quant now by unsloth , full contex size 260k , 94-102 tk/s 20-21GB vram usage , q4 K_V
I’m pretty new to the local AI world. So far, I’ve just been running small models on my mobile workstation (12GB VRAM) to help with my research in Obsidian and managing my Paperless-ngx setup. It’s been cool, but I definitely hit a wall when trying to run anything bigger or more "intelligent", for my use case however not really necessary (I also pay for Claude Pro but usage limits have lately been horrendous, but that's another topic).
I just stumbled across a deal on an NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB). It’s not significantly discounted (around 10% off), but I think the price is not bad (around 9700 USD).
I know these cards are rare and usually meant for big labs, but I’m tempted because I want to run the really powerful models (like the new Gemma 4 or DeepSeek) at home and access them from all my devices without relying on subscriptions.
My questions for the experts:
-
Is 96GB VRAM basically "endgame" for a single-user setup, or would I be better off with something cheaper?
-
Do people use such stuff for what I want to use them (running powerful local LLMs) or rather for AI training or something else?
-
Would I have to build a custom PC to use it? How do I go from a GPU to actually using it?
I don't want to miss a rare price opportunity, but I also don't want to buy a piece of hardware I’ll never fully utilize. What would you do?
I think so. And the ARC iFGX on many laptops is "good enough" for many use-cases.
I wrote code to for a work-project under GDPR; Worked well enough. 15.000 images compared overnight; Took about 7 hours.
Slow, but secure.
After iterating from v6 to v8.3, FlashLM v8.3 outperforms the Transformer baseline on TinyStories generation quality.
Both models trained under identical constraints:
-
Hardware: 2 vCPU / 5GB RAM (free-tier cloud CPU)
-
Time budget: 2 hours wall-clock
-
Dataset: TinyStories (same tokenizer, vocab 4096)
-
Training: from scratch, no pretraining, no distillation
The only variable is architecture.
Models Compared
| Model | Architecture | Params | Training Tokens | PPL |
|---|---|---|---|---|
| v5.2 "Nova-Ignition" | Transformer + RoPE | 5.0M | full 574M (0.027 epochs) | 10.56 |
| v8.3 "CORTEX-VIII" | SWA + Gated Delta Memory | 6.5M | 10M subset (1.5 epochs) | 2.50 |
Note: v5.2 had to train on the full dataset because the 2h budget only allowed 0.027 epochs. v8.3's architecture efficiency allows 1.5 full epochs in the same time.
Generation Samples
Same generation parameters for both models: temperature=1.2, top_k=40 (v5.2) / top_p=0.85 (v8.3), max_tokens=100.
Prompt: "Once upon a time"
| v5.2 (Transformer) | v8.3 (CORTEX) |
|---|---|
Once upon a time on not pen cl nd grab wal . ily L , pl baby Sue dir , jump . aces park so luffy rec , igh made 's Lily star G began not gether ell G Tim ...
|
Once upon a time . sun like . helped look this !" began bed to . thought cake a and fish him Tom Mr Bunny fish . looked Ben place ! thinks book ?" butterfly the had and .
|
Prompt: "The little girl"
| v5.2 (Transformer) | v8.3 (CORTEX) |
|---|---|
| `The little girl ame < | making c tak . nd ould One very His iled ay asked etter eating . ily too ay star j , help were ra se star re ook nicer r big poin .` |
Prompt: "One day a cat"
| v5.2 (Transformer) | v8.3 (CORTEX) |
|---|---|
One day a cat B er fused . nd V rot his , en Spot re M mommy r c loud . day too ay came made ot ven . day ought un there , pl cry not gether ell cl special there wal er L , pl coffee , help not Dad after by ap mommy .
|
One day a cat . wanted and . laughed the but she . looked looked Tom the . lived in ! did do do , in said had ." girl her and tree pretty loved home school rest She She tea every .
|
Observations
-
v5.2 (Transformer) produces random word fragments. It never forms a complete sentence. This is expected — 5M params and 0.027 epochs simply isn't enough for a Transformer to learn syntax.
-
v8.3 (CORTEX) shows clear syntactic structure. Subject-verb-object patterns appear (
helped talk,wanted go,laughed the but she). Characters are named (Tom,Tim,Mr Bunny), actions are sequenced, and there's even a hint of emotion (loved home school rest). -
The repetition problem is largely solved. v8.1 used to output
Lily Lily Lily Lilyendlessly. v8.3 occasionally repeats (play play,do do do) but recovers and continues. -
PPL and generation quality are decoupled at this scale. v8.3's PPL (2.50) is worse than v7.4's (2.33), yet v8.3 generates much better text. Multiple epochs matter more than pure PPL for tiny models.
What Changed from v8.1 to v8.3?
-
Subset training: 10M tokens instead of full 574M → 1.5 epochs in 2h (v8.1 only saw 0.027 epochs).
-
Entropy regularization in loss (weight=0.01) — prevents peaked distributions.
-
Zero weight decay on embedding/head — preserves low-frequency token distinctions.
-
SWA window reduced to 32, FFN kept at 512 — better throughput, same expressiveness.
-
Lookahead value heads down-weighted — they didn't help generation.
Limitations (Honest)
-
Still not fluent. Sentences are broken, grammar is shaky. 6.5M parameters is below the "syntax threshold" for English (~10-20M).
-
TinyStories only. This isn't a general-purpose LLM.
-
v5.2 is 5M, v8.3 is 6.5M. The quality gap is too large to be explained by 1.5M extra params, but I'll be testing a 5M CORTEX variant to make the comparison perfectly matched.
Why This Matters
FlashLM's goal isn't to beat Llama-3. It's to find the highest possible intelligence density under extreme constraints.
CORTEX-VIII combines:
-
Sliding Window Attention (local, O(T))
-
Gated Delta Memory (global, linear recurrence)
-
Ternary-friendly design (though this run used float32 for speed)
At 6.5M params and 2h CPU training, a linear-complexity architecture is already beating a Transformer on generation quality. That's a small but real data point for the "efficient architecture" camp.
Code & Weights:
-
GitHub:
-
v5.2 weights:
-
v8.3 weights:
Questions welcome — happy to share training logs, hyperparameter sweeps, or failed experiments. The v6→v7 graveyard is especially educational.
So I wanted a portable 13 inch laptop that can be a little LLM monster when needed, Asus did an amazing job with their new 2026 PX13 laptopn powered by strixhalo 128G unified memeory APU
I made benchmark automation system for the amazing toolboxs repo here:
This repo gives you multiple ready to use llamacpp builds with rocm and vulkan
my script is setting the power profile to either (power saving or high performance) then benchmark with llama-bench all the provided gguf with 3 diffrent llama backend (vulkan/rocm nightly/amdvlk)
the overall benchmark for 25 models (varies from 4B to 120B) with all diffrent backends and powerprofils, this took almost 12 hours with average time 4 ~ 5 minutes per run for each model at each configuration
side note: I tested multiple "heretic/hauhau versions" of the mainstream model because I found they are much efficient at thinking process and I saw littel increase in their coding performance comparing to original ones (with some drop in transaltions tasks)
Here is the visualized leaderboard
for power profile power saving I saw consumption near 40 watt and for performance it varies from 60 - 77 watt
------------
llama-bench ProArt PX13 HN7306EAC with strix halo toolboxes
-
Machine model:
ProArt PX13 HN7306EAC -
CPU:
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S -
Architecture:
x86_64 -
Kernel:
7.0.0-rc7-2-cachyos-rc -
OS:
CachyOS n/a -
OS Version:
n/a -
Toolboxes:
['llama-rocm7-nightlies', 'llama-vulkan-amdvlk', 'llama-vulkan-radv'] -
Mode:
medium -
Power Profiles:
['performance', 'power-saver'] -
Prompt tokens:
1024,4096,8192,16384 -
Generation tokens:
512,2048 -
Repetitions:
1
Leaderboard (sorted by Token Generation/Second)
| Rank | Model | Best Gen Backend | Power Profile | Prompt/Gen Tokens (Gen) | Best Gen TPS | Best Prompt Backend | Prompt/Gen Tokens (Prompt) | Best Prompt TPS |
|---|---|---|---|---|---|---|---|---|
| 1 | Marco-Nano-Instruct.Q8_0.gguf | llama-vulkan-radv | Performance | 512 | 211.325 | llama-vulkan-radv | 1024 | 4296.133 |
| 2 | Marco-Mini-Instruct.Q8_0.gguf | llama-vulkan-radv | Performance | 512 | 165.874 | llama-vulkan-radv | 1024 | 2329.999 |
| 3 | OpenAI-20B-NEO-CODEPlus-Uncensored-IQ4_NL.gguf | llama-vulkan-radv | Performance | 512 | 86.033 | llama-rocm7-nightlies | 1024 | 1347.876 |
| 4 | gpt-oss-20b-Derestricted-MXFP4_MOE.gguf | llama-vulkan-radv | Performance | 512 | 74.471 | llama-rocm7-nightlies | 1024 | 1317.919 |
| 5 | gpt-oss-20b-heretic.MXFP4_MOE.gguf | llama-vulkan-radv | Performance | 512 | 74.356 | llama-vulkan-radv | 1024 | 1323.742 |
| 6 | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | llama-vulkan-amdvlk | Performance | 512 | 69.059 | llama-vulkan-radv | 1024 | 917.500 |
| 7 | Qwen3.5-35B-A3B-heretic.Q4_K_M.gguf | llama-vulkan-amdvlk | Performance | 512 | 69.001 | llama-vulkan-radv | 1024 | 928.552 |
| 8 | LFM2-24B-A2B-Q8_0.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 60.739 | llama-rocm7-nightlies | 1024 | 1456.713 |
| 9 | Qwen3.5-35B-A3B-Q4_K_M.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 59.614 | llama-rocm7-nightlies | 1024 | 911.428 |
| 10 | Qwen3.5-4B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | llama-vulkan-radv | Performance | 512 | 59.263 | llama-vulkan-radv | 1024 | 1716.063 |
| 11 | Qwen3.5-4B-UD-Q4_K_XL-unsloth-v2.gguf | llama-vulkan-radv | Performance | 512 | 56.642 | llama-vulkan-radv | 4096 | 1600.179 |
| 12 | gemma-4-26B-A4B-it-UD-Q3_K_M.gguf | llama-vulkan-radv | Performance | 512 | 55.191 | llama-rocm7-nightlies | 1024 | 1044.901 |
| 13 | gemma-4-26B-A4B-it-UD-IQ4_XS.gguf | llama-vulkan-radv | Performance | 512 | 52.416 | llama-rocm7-nightlies | 1024 | 1510.919 |
| 14 | bartwoski_Qwen3.5-35B-A3B-Q4_K_M.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 51.307 | llama-rocm7-nightlies | 1024 | 783.849 |
| 15 | gemma-4-26B-A4B-it-UD-Q4_K_XL (1).gguf | llama-vulkan-radv | Performance | 512 | 49.469 | llama-rocm7-nightlies | 1024 | 1620.560 |
| 16 | Qwen3-Coder-Next-UD-IQ1_M.gguf | llama-vulkan-radv | Power Saver | 512 | 48.834 | llama-vulkan-radv | 1024 | 472.070 |
| 17 | Qwen3.5-35B-A3B-UD-Q4_K_XL-unsloth-v2.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 46.992 | llama-rocm7-nightlies | 1024 | 1009.841 |
| 18 | bartwoski_Qwen3-Coder-Next-IQ4_XS.gguf | llama-vulkan-radv | Power Saver | 512 | 41.375 | llama-vulkan-radv | 1024 | 615.839 |
| 19 | kldzj_gpt-oss-120b-heretic-v2-MXFP4_MOE-00001-of-00002.gguf | llama-rocm7-nightlies | Power Saver | 512 | 40.004 | llama-vulkan-radv | 1024 | 432.180 |
| 20 | Qwen_Qwen3-Coder-Next-IQ4_XS.gguf | llama-vulkan-radv | Power Saver | 0/2048 | 39.801 | llama-vulkan-radv | 1024 | 621.813 |
| 21 | Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | llama-vulkan-radv | Performance | 512 | 36.393 | llama-rocm7-nightlies | 1024 | 953.875 |
| 22 | Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive-IQ3_XXS.gguf | llama-vulkan-radv | Power Saver | 512 | 27.562 | llama-rocm7-nightlies | 1024 | 186.736 |
| 23 | omnicoder-2-9b-q8_0.gguf | llama-vulkan-radv | Performance | 512 | 23.944 | llama-rocm7-nightlies | 1024 | 986.071 |
| 24 | bartwoski_Qwen3.5-122B-A10B-IQ3_XXS-00001-of-00002.gguf | llama-vulkan-radv | Power Saver | 512 | 23.206 | llama-rocm7-nightlies | 1024 | 234.785 |
| 25 | unsloth-Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf | llama-vulkan-radv | Power Saver | 512 | 20.771 | llama-rocm7-nightlies | 1024 | 194.398 |
Leaderboard (sorted by Prompt Processing T/Second)
| Rank | Model | Best Gen Backend | Power Profile | Prompt/Gen Tokens (Gen) | Best Gen TPS | Best Prompt Backend | Prompt/Gen Tokens (Prompt) | Best Prompt TPS |
|---|---|---|---|---|---|---|---|---|
| 1 | Marco-Nano-Instruct.Q8_0.gguf | llama-vulkan-radv | Performance | 512 | 211.325 | llama-vulkan-radv | 1024 | 4296.133 |
| 2 | Marco-Mini-Instruct.Q8_0.gguf | llama-vulkan-radv | Performance | 512 | 165.874 | llama-vulkan-radv | 1024 | 2329.999 |
| 3 | Qwen3.5-4B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | llama-vulkan-radv | Performance | 512 | 59.263 | llama-vulkan-radv | 1024 | 1716.063 |
| 4 | gemma-4-26B-A4B-it-UD-Q4_K_XL (1).gguf | llama-vulkan-radv | Performance | 512 | 49.469 | llama-rocm7-nightlies | 1024 | 1620.560 |
| 5 | Qwen3.5-4B-UD-Q4_K_XL-unsloth-v2.gguf | llama-vulkan-radv | Performance | 512 | 56.642 | llama-vulkan-radv | 4096 | 1600.179 |
| 6 | gemma-4-26B-A4B-it-UD-IQ4_XS.gguf | llama-vulkan-radv | Performance | 512 | 52.416 | llama-rocm7-nightlies | 1024 | 1510.919 |
| 7 | LFM2-24B-A2B-Q8_0.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 60.739 | llama-rocm7-nightlies | 1024 | 1456.713 |
| 8 | OpenAI-20B-NEO-CODEPlus-Uncensored-IQ4_NL.gguf | llama-vulkan-radv | Performance | 512 | 86.033 | llama-rocm7-nightlies | 1024 | 1347.876 |
| 9 | gpt-oss-20b-heretic.MXFP4_MOE.gguf | llama-vulkan-radv | Performance | 512 | 74.356 | llama-vulkan-radv | 1024 | 1323.742 |
| 10 | gpt-oss-20b-Derestricted-MXFP4_MOE.gguf | llama-vulkan-radv | Performance | 512 | 74.471 | llama-rocm7-nightlies | 1024 | 1317.919 |
| 11 | gemma-4-26B-A4B-it-UD-Q3_K_M.gguf | llama-vulkan-radv | Performance | 512 | 55.191 | llama-rocm7-nightlies | 1024 | 1044.901 |
| 12 | Qwen3.5-35B-A3B-UD-Q4_K_XL-unsloth-v2.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 46.992 | llama-rocm7-nightlies | 1024 | 1009.841 |
| 13 | omnicoder-2-9b-q8_0.gguf | llama-vulkan-radv | Performance | 512 | 23.944 | llama-rocm7-nightlies | 1024 | 986.071 |
| 14 | Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | llama-vulkan-radv | Performance | 512 | 36.393 | llama-rocm7-nightlies | 1024 | 953.875 |
| 15 | Qwen3.5-35B-A3B-heretic.Q4_K_M.gguf | llama-vulkan-amdvlk | Performance | 512 | 69.001 | llama-vulkan-radv | 1024 | 928.552 |
| 16 | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | llama-vulkan-amdvlk | Performance | 512 | 69.059 | llama-vulkan-radv | 1024 | 917.500 |
| 17 | Qwen3.5-35B-A3B-Q4_K_M.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 59.614 | llama-rocm7-nightlies | 1024 | 911.428 |
| 18 | bartwoski_Qwen3.5-35B-A3B-Q4_K_M.gguf | llama-vulkan-amdvlk | Power Saver | 512 | 51.307 | llama-rocm7-nightlies | 1024 | 783.849 |
| 19 | Qwen_Qwen3-Coder-Next-IQ4_XS.gguf | llama-vulkan-radv | Power Saver | 0/2048 | 39.801 | llama-vulkan-radv | 1024 | 621.813 |
| 20 | bartwoski_Qwen3-Coder-Next-IQ4_XS.gguf | llama-vulkan-radv | Power Saver | 512 | 41.375 | llama-vulkan-radv | 1024 | 615.839 |
| 21 | Qwen3-Coder-Next-UD-IQ1_M.gguf | llama-vulkan-radv | Power Saver | 512 | 48.834 | llama-vulkan-radv | 1024 | 472.070 |
| 22 | kldzj_gpt-oss-120b-heretic-v2-MXFP4_MOE-00001-of-00002.gguf | llama-rocm7-nightlies | Power Saver | 512 | 40.004 | llama-vulkan-radv | 1024 | 432.180 |
| 23 | bartwoski_Qwen3.5-122B-A10B-IQ3_XXS-00001-of-00002.gguf | llama-vulkan-radv | Power Saver | 512 | 23.206 | llama-rocm7-nightlies | 1024 | 234.785 |
| 24 | unsloth-Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf | llama-vulkan-radv | Power Saver | 512 | 20.771 | llama-rocm7-nightlies | 1024 | 194.398 |
| 25 | Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive-IQ3_XXS.gguf | llama-vulkan-radv | Power Saver | 512 | 27.562 | llama-rocm7-nightlies | 1024 | 186.736 |
Here is more detailed tables with exact context length for each run
I made an app A.I.R.I, it runs LLMs locally on your phone. I’ve made a pretty big upgrade from its initial release and it’s starting to feel like something more than just a chat app.
The main idea now is: your phone = a personal AI server
It can:
-
run models locally
-
be accessed by other devices on your Wi-Fi
-
support voice conversations (TTS + STT)
-
handle documents with a simple RAG pipeline
-
manage and download models inside the app
-
keep chat history + user profiles for context
-
I also completely refactored the architecture so it’s modular and easier to extend (which was badly needed).
Still a work in progress, but this is the first time it feels like the original idea is actually working. Repo:
So this might very well be user error on my end but please let me know if whatever I am doing is somehow wrong:
-
M4 Max (highest core count version), 64GB of unified memory
-
Using oMLX 0.3.5dev1 version for serving, gemma 4bit it 26-a4b (200k context)
-
Opencode harness for running the model - no custom instructions for now
Consistently I see the LLM not doing what it is said to do. For example - I have some here:
-
Don't see it thinking all the time. I have it as "high" variant in opencode which sets the thinkingBudget to 8092 tokens, and have "forced" it to do so within oMLX with the chat template, thinking budget, - but it does not always think. For some reason - it also stops after saying it will do a certain tool call but it does not. I don't know if this is a result of the qwen reasoning parser that I'm using or not? If anyone is using oMLX - let me know what reasoning_parser you are using.
-
Another random question I have is -- I'm seeing a lot of people run this on my hardware - that the token generation speeds are much higher - however they are using lesser context (I'm using 200k). Is that the reason or am I doing something else wrong here?
-
It goes into repetition loops. I am using default repetition penalty but sometimes its just bad (this was with oMLX v0.3.3 so maybe this has been patched in since) Screenshot for this also attached:
So this has been my experience - let me know if I'm doing anything obviously wrong or whether this is a case where I just simply have to tone down my expectations. I know I can't have SOTA like expectations for model of this size but idk if I'm miscalibrated or not - But I think because a lot of hype with this Gemma 4 release - I thought it would be something that is able to call tools reliably vs my experience with some older models (GPT-OSS 20B/Qwen 3 Next/Qwen 3 coder models - the gpt 20b version used to do this "I'll call the tool" and would just stop - the qwen models were better)
So not sure whether this is a calibration problem/I don't have a proper system prompt that works well with this model on opencode/I have some settings that are wrong.
They have become completely unusable over the past few days.
A few things I have noticed:
- Codex has cut its 5-hour session cap massively so now you can barely tell it to program fizz buzz before running out of tokens.
- Claude Code has the same problem.
They have both just massively dropped in intelligence as well. I have heard people on X talking about how Anthropic models are being throttled in terms of intelligence (for non API tokens). I have had the same problem with GPT-5.4 where it just refuses to do stuff and has a bias to not take actions even if explicitly stated (which I've heard is a byproduct of limiting reasoning tokens).
This causes people to have to send more messages which then uses even more input & output tokens.
Might take the open-souce pill. Perhaps Qwen3.5 27B locally, and GLM5.1 on the cloud.
What techniques help to reap the benefits of AI code without it accumulating into massive technical debt requiring costly re-writes?
I'm thinking about vibecoding the next part of my project, but I will probably with a lot of confidence need someone with a lot more experience, someone or a company that can help me figure out everything to learn this fast or assist me.
The scope and amount of code is relatively small and not complicated, its rather small snippets. (I think I can provide the precise architecture)
Has someone any idea where I can find assistance?
With the merging of , all of the fixes to known Gemma 4 issues in Llama.cpp have been resolved. I've been running Gemma 4 31B on Q5 quants for some time now with no issues.
Runtime hints:
-
remember to run with `--chat-template-file` with the interleaved template Aldehir has prepared (it's in the llama.cpp code under models/templates)
-
I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems
-
running KV cache with Q5 K and Q4 V has shown no large performance degradation, of course YMMV
Have fun :)
(oh yeah, important remark - when I talk about llama.cpp here, I mean the *source code*, not the releases which lag behind - this refers to the code built from current master)
Important note about building: DO NOT currently use CUDA 13.2 as it is CONFIRMED BROKEN (the NVidia people are on the case already) and will generate builds that will not work correctly.
What they wish to convey is can AI act like a computer? the team tried training a video model to generate simulation for terminal and desktop and got decent results. check more details :
paper :
I put together a small educational repo that implements distributed training parallelism from scratch in PyTorch:
Instead of using high-level abstractions, the code writes the forward/backward logic and collectives explicitly so you can see the algorithm directly.
The model is intentionally just repeated 2-matmul MLP blocks on a synthetic task, so the communication patterns are the main thing being studied.
Built this mainly for people who want to map the math of distributed training to runnable code without digging through a large framework.
Based on
by :
We just updated them again in response to:
-
kv-cache : support attention rotation for heterogeneous iSWA
-
CUDA: check for buffer overlap before fusing - CRITICAL fixes
<unused24> tokens -
vocab : add byte token handling to BPE detokenizer for Gemma4
-
convert : set "add bos" == True for Gemma 4
-
common : add gemma 4 specialized parser
-
llama-model: read final_logit_softcapping for Gemma 4
-
llama: add custom newline split for Gemma 4
I have a modest rig that allows me to run Qwen 3.5 27B or even 35B via Ollama. Qwen has been amazing to work with and I've been fine with the slow drip trade-off.
Then Google released Gemma4.
Its fast - like 4 or 9B fast. Accuracy and confidence wise, reminds me of that first release of Gemini Pro that could actually produce code that would run.
As a "local guy" this shift in useability and confidence for a small self hosted LLM reminded me of what Deepseek brought to the table years ago with the thinking capability.
Give it a go when you have a chance, and apply the settings that google recommends, it does make a difference (slightly slower but better)
I tried a few releases and this one worked the best for all the tests I threw at it with law interpretation, python, brainstorming & problem solving.
bjoernb/gemma4-26b-fast:latest (not affiliated with whoever made this)
in the next few days I'll start checking the abliterated versions to see how they stand with pentest & sysec tasks vs Qwen
Just finished quantizing MiniMax-M2.7 to GGUF. All standard quant levels available:
- BF16 (~427 GB)
- Q8_0 (~243 GB)
- Q6_K (~188 GB)
- Q5_K_M (~162 GB)
- Q4_K_M (~138 GB)
- Q3_K_M (~109 GB)
- Q2_K (~83 GB)
After much much much testing of various models for: Openclaw, Hermes, Claude Code, and 'random creative requests' - here is my currently working setup.
For Claude Code/Openclaw.
-
I use AIRun to override Claude's model to Ollama, using GLM 5.1:cloud - i find this to be the best. Openclaw defaulting to the same. It's a bit slow, but way more reliable than Minimax - I find Minimax is way more likely to be a cowboy and do stuff you didn't ask or want it to do.
-
Local big model: Gemma4-26B-q4 - this thing is amazing. Performance through the roof locally on a M4Max, and it doesn't use up a zillion tokens on reasoning like Qwen does. Great for coding and reasoning locally. This is my local workhorse now.
-
Creative tasks: Joke-of-the-day, basic writing stuff - llama 3.2 3B - tiny, fast as f*** and does a great job and basic stuff. I find it to be the most creative and human of the models I've tested for creative writing.
I tried Qwen over and over but just had tons of issues, especially with too much reasoning (couldn't tweak it to low or medium) and just general performance.
Interested to hear your experiences.
Hey ,
I am releasing my first model quantization: an 8-bit symmetric AWQ (W8A16) of , specifically optimized for Ampere GPUs (RTX 30-series) using vLLM with the Marlin kernel on a single-GPU inference setup.
kai-os/Carnice-9b is a specialized fine-tune of Qwen/Qwen3.5-9B that removes the visual components and adopts the Qwen3_5ForCausalLM architecture for pure text/agentic use (Hermes Agent harness). This architecture is not yet natively supported by vLLM (pending PR #39316).
To enable seamless loading, the quantized checkpoint re-wraps the weights into the Qwen3_5ForConditionalGeneration architecture (matching the original Qwen/Qwen3.5-9B configuration). This allows vLLM to serve it correctly with the --language-model-only flag for text-only inference.
Model:
Benchmark highlights (vLLM bench on random dataset, single RTX 3090 + Marlin):
• Average prompt throughput: ~1,994 tokens/s
• Average generation throughput: ~222 tokens/s
I'm gonna run some benchmarks specific to the Hermes agent environment (Terminal Bench Lite and YC bench). From a quick vibecheck it seems pretty good
Quick vLLM usage (single GPU):
vllm serve TurbulenceDeterministe/Carnice-9b-W8A16-AWQ \ --max-model-len auto \ --reasoning-parser qwen3 \ --language-model-only \ --tensor-parallel-size 1
I would greatly appreciate your feedback on how to improve future quantizations. Thank you!
Found this tweet online and wanted to see if anyone here had any opinions on it.
I'm an AI Researcher and have been exploring Latent Space Reasoning for a bit (mid-2024, really got into it when Meta published Coconut. This would check out in a few ways--
-
The perfdormance mentioned here.
-
The order-of-magnitude reduction when comparing Mythos and Opus 4.6 for BrowseComp.
-
General discussions from researchers in the space.
I've personally done some research into it, and I think it will be the future of AI and reasoning models. Too many reasons for it not to be (especially if we create a unified reasoning plane that models can plug in and out of). Too many reasons for it not to be. Wanted to get your thoughts on it, espcially if anyone else has tried it.
Did a bunch of experiments on it here, incase anyone is interested (would love to hear your experiences with it as well)-
Hi everyone,
I am new to this community, this is my first blog post here (forgive if there are any mistakes).
I recently came across this blog post on pytorch website, , my understanding of what this does (please correct me if I am wrong): It generates custom triton kernels for various attention implementations, (some kind of compiler for attention), this helps save memory and latency during the scaled dot product attention computation, as this heavy work can be smartly offloaded to the GPU.
I found it very interesting and would like to use it in one of my projects, for this I need to integrate this to an actual LLM (say LLama3/3.1/3.2), since this provides only the attention computation, how can I integrate it with weights of an actual LLM? Almost all the tutorials I saw for flex attention generate random Q, K and V matrices for demonstration.
There is also an option of using something like `attn_implementation=flex_attention`, but then how do I use the `score_mod` and `mask_mod` attributes?
Is there some documentation, or a git repo doing this? Any guidance on how to approach this would help.
I have been using the official api minimax-m2.7 and minimax-m2.5 in claude code since the first day of release and minimax-m2.5 always seems to complete tasks and figure things out faster than 2.7.
Minimax-m2.7 halucinates too much, and I haven't see any improvement when it comes to real world usage in literally any task, but I have noticed regression.
In terms of reliability 2.5 > 2.7
I have no idea why this is the case when it performs better on all benchmarks...
I've been using a 5090 build as a hybrid PC (80% local LLM, 20% gaming). It is essentially a near-maxed out consumer setup (9950x3d, 128GB RAM).
I've recently decided to commit more to building some LLM workflows for my partner's local business (plus some other local colleagues) and have a new 6000 Pro Max-Q coming soon to expand to larger models w/ larger context (was able to get good business pricing + NVIDIA Inception discount).
I'm inclined to just add it to my current setup to upgrade the 'core' LLM portion of my usage. I'd keep the 5090 as a dev gpu for testing out new models and/or learning multi-model workflows, plus gaming. My only concern is that keeping the 5090 attached will handicap the 6000 by cutting the PCIE bandwidth of my mobo in half (x8/x8 vs x16).
I've also been tempted to just sell the 5090 and get another 6000, but that seems to overshadow the rest of the machine (would likely want 256GB RAM, plus same PCIE conundrum)
I do like the hybrid-ness of the current setup and potential of a 6000/5090 since it shares costs across multiple budgets (gaming, hobby/learning, business), but feels like I'm reaching a max point of those activities starting to interfere with each other.
Does anyone have a similar build and like it? Is this a dumb 'trying to do everything' machine that I should commit one way or another on? At what level does a machine have to move on from consumer components?
Are local 4B models usable on smartphone?
Just did a vibe check on a Pixel Pro 10, Gemma 4B vs Qwen 3.5 4B, starting from handheld photos of ninth grade STEM tests (written in French, I asked in English, and both models replied in English)
Gemma 4 E4B via Google AI core runs on NPU: quite fast, energy efficient, but hallucinated about half the text from the image and failed. When the tests were manually entered as text, it gets most of them right.
Qwen 3.5 4B Q4_K_M via PocketPal (llama cpp under the hood) not only got all the text right, it also passed all the tests without errors. But, phone got very hot, and then it would slow down to a crawl after a couple hundred tokens (but would regain speed when allowed to cool down, even on long context)
Interestingly enough, the Qwen model is slightly smaller (3.4GB vs 3.6GB), if it would get NPU support and basic tools, I suspect it could cover everyday AI needs locally...
I’ve got a project where I want to translate text between languages. Does anyone know what would be the best model to use for this task?
I was thinking to throw the largest QWEN model I can fit in memory at it, since it would probably do the job, but idk if there are smaller/better purpose built models for this since it’s a well-defined task.
It will be happening offline, so speed/efficiency isn’t a factor, quality of output is the main consideration.