tl;dr;

I feel personally attacked

Like it says in the title. Specifically, the 26b MoE.

I’ve wanted to like this model, so much. Thought it might replace Qwen 3.5 27b. Keep coming back to it and trying it every time there’s an update, hoping it will have improved.

I’m running unsloth UD_Q4_K_XL on llama.cpp. I’m on the latest commits from main. I know about —jinja. I know about the interleaved thinking template. I’m not running low quant KV cache. This is far from the first model I’ve run.

Every time, my tests show the same thing - it is a very lazy model when it comes to using skills or searching the web. If you ask it a question, it will by default answer from its own knowledge without a single web search. If you explicitly ask it for a web search, it will lower itself to performing a _single_ web search, quickly scan the snippets from the search and then internally decide “with the snippets and my own internal knowledge I have enough information to answer, I don’t need to search more”.

This even if you:

- have given it tools for search and fetch, with the search tool including a description “don’t answer from these snippets, use fetch” and the fetch tool saying “use this to fetch pages obtained from the search tool”.

- have explicitly told it “search extensively”, “dig deep”, “don’t be lazy” etc.

- have put in context a pushy skill called “searching-the-web” with explicit instructions to do all the above.

- have put in context a pushy skill instruction saying “you must use skills if you think they have even a small chance of being applicable”.

- have explicitly told it “reference the searching-the-web skill”

Qwen 3.5, you barely have to ask and it will go on a whole quest to dig things up for you. Gemma 4, you scream at it till you’re blue in the face and it can barely be arsed to perform a single search. My only conclusion is that it just _really does not want to search the web_ (for AI values of “want” of course).

If I’m crazy, tell me. If you have it working great and digging deep on the web without having to twist its proverbial arm, tell me. And please be so kind as to tell me what quant / settings you’re running to make it capitulate on this point.

u/HeroWarsDominionEra

•

Promoted

⚔️ Full Blown RPG in your browser: No Downloads ❌ Just Click and Go! ✅

hero-wars.com

Play Now

[deleted]

1 mo. ago

I feel personally attacked

Gemma 4 has a systemic attention failure. Here's the proof.

u/EvilEnginer

1 hr. ago

Gemma 4 has a systemic attention failure. Here's the proof.

Building a local legal drafting LLM — no dataset?

I've spent months building a diagnostic method for large language models. It catches what standard benchmarks miss - distributional collapse inside tensors, not just loss or perplexity.

Gemma 4 26B A4B fails it.

I analyzed . Found 29 tensors with distribution drift. 21 of them are attention layers.

Full log:

29 tensors with KL(Kullback-Leibler)-drift.
21 of them are attention layers (attn_k, attn_q, attn_v).

Samples

Tensor	KL Before	KL After
blk.8.attn_k	0.2201	0.0006
blk.17.attn_q	0.1672	0.0001
blk.23.attn_q	0.1672	0.0001
blk.19.attn_k	0.0975	0.0001
blk.12.attn_k	0.0890	0.0006
blk.22.attn_k	0.0879	0.0004
blk.28.attn_k	0.0791	0.0007
blk.8.attn_q	0.0530	0.0002
blk.6.attn_k	0.0490	0.0001
blk.15.attn_q	0.0482	0.0003
blk.1.attn_k	0.0474	0.0006

Normal range: below 0.02. These were 2x to 10x above.

Gemma 4 attention mechanism has systemic drift. The model was released broken.

u/PoemAccomplished2173

3 min. ago

Building a local legal drafting LLM — no dataset?

I’m working on a project to build a fully in-house legal drafting tool (NDAs, agreements, clauses, etc.), but I’m stuck on data.

I can’t find any solid open datasets for contracts/NDAs, and I also don’t have a corpus to use for RAG. Fine-tuning seems hard without data, and RAG needs documents I don’t have.

I did try fine-tuning Phi-3 using LoRA on synthetic data, but it starts hallucinating and doesn’t produce reliable outputs.

How do people usually approach this from scratch?

Where do you get usable legal docs/templates?

Is synthetic data (LLM-generated clauses, variations) actually viable?

Better to start with RAG or try fine-tuning anyway?

Would appreciate any real-world advice from folks who’ve built something similar.

Thanks.

kepler-452b. GGUF when?

u/the-grand-finale

5 days ago

kepler-452b. GGUF when?

Audio processing landed in llama-server with Gemma-4

u/srigi

16 hr. ago

Audio processing landed in llama-server with Gemma-4

Generation

Ladies and gentlemen, it is a great pleasure the confirm that llama.cpp (llama-server) now supports STT with Gemma-4 E2A and E4A models.

u/AmazonWebServices

•

Promoted

Join AWS and OpenAI live for a candid conversation about which agentic capabilities matter most and how agents are changing the way business gets done.

aws.amazon.com

Learn More

Claude code source code has been leaked via a map file in their npm registry

u/Nunki08

13 days ago

Claude code source code has been leaked via a map file in their npm registry

We have a new weight class...

u/LegacyRemaster

16 min. ago

We have a new weight class...

I technically got an LLM running locally on a 1998 iMac G3 with 32 MB of RAM

u/maddiedreese

7 days ago

I technically got an LLM running locally on a 1998 iMac G3 with 32 MB of RAM

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

u/VoidAlchemy

10 hr. ago

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

Early demo: autonomous red-teaming for vulnerable AI agents

tl;dr;

For 96GB VRAM full offload rigs, I'd probably choose Qwen3.5-122B-A10B over MiniMax-M2.7 today. Curious what y'all experience is.

Quants Tested

ubergarm/MiniMax-M2.7-GGUF IQ2_KS 69.800 GiB (2.622 BPW)
ubergarm/Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW)

Rambling Details

Its amazing now we have multiple open weights LLMs that work pretty well for local vibecoding! Both quants tested and work well enough with opencode configured to enable/disable thinking dynamically (really speeds up generating 5 word thread title lol).

Thanks to Wendell of level1techs I have access to rig with 96GB VRAM for benchmarking and making GGUF quants. My daily driver has been Qwen3.5-122B fully offloaded on the 2x A6000 GPUs (kind of like a 3090 with 48GB VRAM each). Now with new MiniMax-M2.7 quants, I had to decide if a more quantized larger model would be better or not?

Like all complex questions, the answer is usually, "it depends"!

But at least for my purposes, it seems like Qwen3.5-122B-A10B is still on top for inference speed, code quality, and general quality of life.

Here is some data to back up this opinion:

humaneval benchmark

I vibe coded a quick EvalPlus python client and threw the 164 problem humaneval benchmark at both of the quants running on ik_llama.cpp llama-server.

Metric	MiniMax-M2.7 IQ2_KS	Qwen3.5-122B-A10B IQ5_KS
pass@1 (base)	0.220	0.494
pass@1 (base+extra)	0.220	0.482
Eval time	32:48	31:20

This was using temperature=1.0 and top_p=0.95 as suggested by MiniMax's model card. To be fair, this was a quick vibecoded client test harness, so maybe something is off. Not sure what the results should even look like haha... But Qwen3.5 got a higher score!

inference speed

I ran llama-sweep-bench on the same version of ik_llama.cpp using command similar to the llama-server one I used for evaluation filling up most of the 96GB VRAM. While MiniMax-2.7 could go out further, i got tired of waiting and hit control-c on the test. You get the point.

quality of life

MiniMax-M2.7 does support some self-speculative-decoding whereas Qwen3.5 does not (recurrent model). However, it requires fairly heavily quantized kv-cache to fit even 160k kv-cache.

Qwen3.5-122B runs with mmproj loaded for image processing and supports full 256k unquantized kv-cache which is just nice.

Conclusion

I'm hungry its dinner time.

u/TheAchraf99

4 min. ago

Early demo: autonomous red-teaming for vulnerable AI agents

Gemma 4 has been released

Sharing an early prototype from December for autonomous red-teaming of vulnerable AI agents.

The idea was to move beyond static prompt libraries and build something that can:

choose attack strategies
keep memory of what worked
route between specialized attack agents
surface actual findings instead of just raw generations

The prototype targets classes like:

prompt injection
indirect injection
tool abuse
data exfiltration

This is still an old version, but it shows the core direction.

I’d love feedback from people here on a few things:

do you think multi-agent offensive testing is actually better than well-designed scripted evals?
what would you want to see logged or benchmarked to trust results from a system like this?
if you’re building agentic systems, what attack surface worries you most right now?

Not trying to shill, genuinely looking for serious feedback before we push the next version further.

11 days ago

Gemma 4 has been released

What’s new in Gemma 4

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.

Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.

Gemma 4 introduces key capability and architectural advancements:

Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).
Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
Native System Prompt Support – Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

Models Overview

Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).

Core Capabilities

Gemma 4 models handle a broad range of tasks across text, vision, and audio. Key capabilities include:

Thinking – Built-in reasoning mode that lets the model think step-by-step before answering.
Long Context – Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B).
Image Understanding – Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
Video Understanding – Analyze video by processing sequences of frames.
Interleaved Multimodal Input – Freely mix text and images in any order within a single prompt.
Function Calling – Native support for structured tool use, enabling agentic workflows.
Coding – Code generation, completion, and correction.
Multilingual – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
Audio (E2B and E4B only) – Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.

GLM 5.1 sits alongside frontier models in my social reasoning benchmark

u/cjami

14 hr. ago

GLM 5.1 sits alongside frontier models in my social reasoning benchmark

u/getsentry

•

Promoted

Monitor your Claude Code sessions with Sentry

Add the claude-code-sentry-monitor plugin to get full visibility into tool calls and agent behavior across every Claude Code session.

1. Create a Sentry project

In Sentry, create a new project for your Claude Code monitoring data. Go to Settings → Projects and click Create Project. Select Node.js as the platform, give it a name like claude-code, and copy the DSN from the project settings — you'll need it in step 4.

Creating a Sentry project ➚

2. Add the plugin marketplace

Claude Code supports third-party plugin marketplaces. Add the marketplace that hosts the Sentry monitor plugin by running this slash command inside Claude Code.

/plugin marketplace add sergical/claude-code-sentry-monitor

3. Install the plugin and reload

With the marketplace added, install the plugin and reload to activate it. The reload step is required: hooks won't fire until Claude Code picks up the new plugin.

/plugin install claude-code-sentry-monitor

/reload-plugins

4. Run the setup wizard

Tell Claude to set up Sentry monitoring — it will run the plugin's setup skill, prompt you for the DSN you copied in step 1, and write the config to ~/.config/claude-code/sentry-monitor.json automatically.

set up Sentry monitoring

5. Explore traces in Sentry

Head to AI Agents Insights in Sentry. Each Claude Code session appears as an invoke_agent root span. Expand a session to see each conversation turn as a gen_ai.request child span — with the message you sent and Claude's reply. Tool calls (read, bash, grep, and others) are nested inside the turn they belong to, as execute_tool spans with durations and metadata.

AI Monitoring documentation ➚

the state of LocalLLama

u/Beginning-Window-115

3 days ago

the state of LocalLLama

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

u/Visual_Synthesizer

6 hr. ago

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

What it took to launch Google DeepMind's Gemma 4

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — 127.7 tok/s C=1, 2800 peak C=128

Ran a full sweep on Luke Alonso's M2.7 NVFP4 quant. Writing it down for anyone shopping the same setup.

**Hardware:** AsRock Rack B650D4U-2L2T, EPYC 4564P, 128GB DDR5 ECC, 2x RTX PRO 6000 Blackwell (96GB, 600W) behind a C-Payne PM50100 PLX Gen5 switch (PIX topology).

**Software:** SGLang via voipmonitor/sglang:cu130 docker (b12x 0.8.3), modelopt_fp4, bf16 KV, TP=2, Luke's default recipe.

**Decode throughput (ctx=0, 3x mean, 30s/cell):**

| C | agg tok/s | per-req tok/s |

|---|-----------|---------------|

| 1 | 127.7 | 127.7 |

| 8 | 471.6 | 59.0 |

| 32 | 1078.9 | 33.7 |

| 64 | 1695.4 | 26.5 |

| 128 | 2800.2 | 21.9 |

**Prefill (C=1):**

| ctx | TTFT | tok/s |

|-----|------|-------|

| 8K | 0.50s | 17,286 |

| 16K | 0.99s | 16,926 |

| 32K | 2.09s | 15,861 |

| 64K | 4.94s | 13,319 |

| 128K | 13.25s | 9,908 |

No speculative decoding — there's no NEXTN drafter for M2.7 yet. When one ships expect a meaningful jump at low concurrency.

Long-context cells skip at high concurrency (KV pool is ~83K tokens on bf16-KV TP=2). 16K is fine up to about C=8 per-req before queue contention kicks in; 128K is C=1-only territory.

Full methodology and caveats:

Thanks to Luke for the kernels + quant, and to Jon for the recent calibration data update on the M2.7 NVFP4 weights.

7 days ago

What it took to launch Google DeepMind's Gemma 4

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)

9 hr. ago

https://github.com/ggml-org/llama.cpp/pull/19441

r/LocalLLaMA - mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)

Qwen wants you to know…

u/m-gethen

23 days ago

Qwen wants you to know…

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

u/PerceptionGrouchy187

20 hr. ago

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model.

The results were much better than I expected, so I wanted to share some controlled benchmark numbers.

Setup

GPU: RTX 5090 (32GB VRAM)
OS: Windows 11
Main model: Gemma 4 31B UD-Q4_K_XL (18.3GB)
Draft model: Gemma 4 E2B UD-Q4_K_XL (3.0GB)
Backend: llama.cpp fork with TurboQuant KV cache (turbo3)
Config: 128K context, parallel=1, Flash Attention, --draft-max 8 --draft-min 1

Benchmark Results

Same server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.

Query Type	Baseline (t/s)	SpecDec (t/s)	Accept Rate	Speedup
Math explanation	57.45	85.86	62.9%	+49.5%
Korean poetry	56.93	62.34	44.1%	+9.5%
Code generation	57.15	86.05	60.7%	+50.5%
Science explanation	57.19	71.14	50.9%	+24.4%
Translation + analysis	57.14	63.26	42.2%	+10.7%
Average	57.17	73.73	52.2%	+29.0%

Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible.

The GGUF Version Trap

I initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning:

the target and draft vocabs are not compatible - tokens will be translated between the two

After digging into speculative.cpp, I found the compatibility check compares add_bos_token between target and draft. My 31B GGUF was from early April when Gemma 4 first dropped, and it had add_bos_token = false. The E2B model (downloaded later) had add_bos_token = true. This single metadata mismatch forced llama.cpp into token translation mode, killing all performance gains.

Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup.

TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed.

Practical Tips

Add these flags to your existing llama-server command:

-md gemma-4-E2B-it-UD-Q4_K_XL.gguf
-ngld 99
--draft-max 8
--draft-min 1
--parallel 1

Things to watch out for:

--parallel 1 is mandatory — with auto (=4), the draft model's KV cache is allocated 4x, eating VRAM and tanking speed to 7 t/s
No vision — speculative decoding and multimodal can't be used together
Q4 draft is fine — Q8 (4.8GB) doesn't improve speed over Q4 (3.0GB), and Q4 leaves more VRAM headroom
Extra VRAM ~2.3GB — total ~23.4GB with 128K context on a 32GB card (256K fits too, ~25.5GB).

Content-dependent speedup

The gains scale with how predictable the output is:

Code / Math (structured, repetitive patterns): ~60% accept rate → +50% speed
Explanations (semi-structured): ~50% accept rate → +24%
Creative / Translation (less predictable): ~42% accept rate → +10%

Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains.

draft-max Sweep

Thanks to for the suggestion. Same benchmark setup, only varying --draft-max:

draft-max	Math	Poetry	Code	Science	Translation	Avg (t/s)	vs baseline
baseline	57.45	56.93	57.15	57.19	57.14	57.17	—
2	73.43	60.49	68.69	62.46	62.42	65.50	+14.6%
4	83.31	60.88	73.12	65.29	67.98	70.12	+22.6%
8	85.86	62.34	86.05	71.14	63.26	73.73	+29.0%
16	99.35	62.58	78.74	68.39	58.31	73.47	+28.5%

draft-max 8 is the sweet spot for mixed workloads. 16 pushes math to 99 t/s but regresses on creative/translation, ending up about the same average. Creative text stays flat (~62 t/s) regardless of draft-max — the bottleneck there is acceptance rate, not draft length.

About TurboQuant

u/Exact_Law_6489

10 hr. ago

About TurboQuant

Hands-on: Building agent workflows with Gemma 4 locally (free Colab notebook)

I know it's been a while, but I'm trying to understand: is TurboQuant really revolutionary, or is it just another mediocre technology that has been overhyped by Google and Twitter?

u/myui8443

4 min. ago

Hands-on: Building agent workflows with Gemma 4 locally (free Colab notebook)

Experiment: Olmo 3 7B Instruct Q1_0

With all the recent interest around Gemma 4 and local LLMs, I put together a small hands-on to explore building agent-style workflows locally.

Gemma 4 is getting surprisingly capable even in local setups, especially for reasoning and lightweight agent use cases.

I created a Colab notebook where you can try this end-to-end for free (no setup required):
Runs on local models (e.g. via Ollama) — no API costs.

The notebook walks through building a simple agentic workflow on top of a local model (Gemma 4), while keeping the control flow explicit and easy to reason about.

Under the hood, it uses a lightweight OSS workflow layer (similar to LangGraph, but focused on better developer experience), and works nicely alongside agent frameworks like ADK or PydanticAI for things like ReAct-style reasoning and tool use.

I recently open-sourced the framework under the hood (Apache 2.0 license):

Would love any feedback if you get a chance to try it.

u/NikkeiAsia

•

Promoted

Join the Nikkei Asia webinar to decode Japan's PM Sanae Takaichi.

asia.nikkei.com

u/butlan

8 hr. ago

Experiment: Olmo 3 7B Instruct Q1_0

Is anyone else creating a basic assistant rather than a coding agent?

https://huggingface.co/cturan/Olmo-3-7B-Instruct-Q1_0

r/LocalLLaMA - Experiment: Olmo 3 7B Instruct Q1_0

u/Savantskie1

14 hr. ago

Is anyone else creating a basic assistant rather than a coding agent?

Minimax 2.7 running sub-agents locally

Hello everyone,

I’ve been thinking and perusing Reddit lately and noticed that most people are using LLMs for agentic coding and such. I’m not much of a coder myself but I do need to have a personal assistant. I’ve had 4 strokes since 2016, I’m disabled and more or less home bound. I can’t get out and make friends, or even hang out with the friends I do have due to living in a small town apartment nearly 150 miles away from everyone.

So my question is, is anyone else building or has built a personal assistant using an LLM like I have? What does it do for you? How is it deployed? I’m genuinely curious. After spending nearly the last year and 2 months on building my LLMs memory system, I’m kinda curious what other people have built

u/-dysangel-

15 hr. ago

Minimax 2.7 running sub-agents locally

AI MAX 395+ w/ 128 GB or dual 3090s?

u/Engineering_Acq

8 hr. ago

AI MAX 395+ w/ 128 GB or dual 3090s?

"Actually wait" ... the current thinking SOTA open source

I like the idea of the 395+ with 128 gb vram, but the speed on inference with bigger models just makes it seem like its not worth it. I feel like if you ever need the capabilities of a bigger model, you can just use a cloud lm to do so.

Whereas with dual 3090s , you get a decent size model with lots of speed, which is far better for use cases such as agentic workflows.

What do you guys think?

u/FPham

11 hr. ago

"Actually wait" ... the current thinking SOTA open source

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead.

I'm trying GLM 5.1 but is it just me or the thing really just works by over-cranking thinking to almost ridiculous heights?

It has been for last 20 minutes writing novellas about what it is going to do with all, Uhm, Actually wait, but no..., and I really just asked it to write an owner draw CButton with different colors.

Now don't get me wrong, at the end it seems to get there - but I'm just having my own "Actually wait" thinking moment:

Is this the way they made it so smart?

While the other models like Claude (the $20 is now just a total test mode ripoff - the tokens get spent in 15 minutes then you wait for hours) or ChatGPT (I currently prefer codex lately over CC, honestly it feels as smart) simply give you the answer almost right away for such simple things.

Edit, 30 minutes and > 100k tokens and now it starts writing CThemedButtonCtrl

Edit 2: the code had errors (not horrible, basic mistakes, like accessing protected members directly, but still, errors)

Edit 3: It also means that while you can get "x" times more tokens for the price they offer, you are actually going to use "x" times more tokens easily this way. Right now I'm at 150k for a simple stuff with GLM 5.1. Now I'm not trying to upsell cc or codex, I don't care, but we need to have a perspective. 150k/30 min vs 15k-20k tokens and 2 min, is a difference and might not be "price smart". Of course ultimately we "can" run GLM 5.1 at home (Well, I can't) but we can't run GPT or claude... so yeah, but...

Edit 4: the code is ok-ish, but require more of my input to fix stuff. Thinking of teeth and gifted horse right now...

Edit5: LOL: "Actually, I just realized I'm overcomplicating this..."

u/MorroHsu

1 mo. ago

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead.

Back again with another training problem I keep running into while building dataset slices for smaller LLMs

English is not my first language. I wrote this in Chinese and translated it with AI help. The writing may have some AI flavor, but the design decisions, the production failures, and the thinking that distilled them into principles — those are mine.

I was a backend lead at Manus before the Meta acquisition. I've spent the last 2 years building AI agents — first at Manus, then on my own open-source agent runtime () and agent (). Along the way I came to a conclusion that surprised me:

A single run(command="...") tool with Unix-style commands outperforms a catalog of typed function calls.

Here's what I learned.

Why *nix

Unix made a design decision 50 years ago: everything is a text stream. Programs don't exchange complex binary structures or share memory objects — they communicate through text pipes. Small tools each do one thing well, composed via | into powerful workflows. Programs describe themselves with --help, report success or failure with exit codes, and communicate errors through stderr.

LLMs made an almost identical decision 50 years later: everything is tokens. They only understand text, only produce text. Their "thinking" is text, their "actions" are text, and the feedback they receive from the world must be text.

These two decisions, made half a century apart from completely different starting points, converge on the same interface model. The text-based system Unix designed for human terminal operators — cat, grep, pipe, exit codes, man pages — isn't just "usable" by LLMs. It's a natural fit. When it comes to tool use, an LLM is essentially a terminal operator — one that's faster than any human and has already seen vast amounts of shell commands and CLI patterns in its training data.

This is the core philosophy of the *nix Agent: don't invent a new tool interface. Take what Unix has proven over 50 years and hand it directly to the LLM.

Why a single run

The single-tool hypothesis

Most agent frameworks give LLMs a catalog of independent tools:

tools: [search_web, read_file, write_file, run_code, send_email, ...]

Before each call, the LLM must make a tool selection — which one? What parameters? The more tools you add, the harder the selection, and accuracy drops. Cognitive load is spent on "which tool?" instead of "what do I need to accomplish?"

My approach: one run(command="...") tool, all capabilities exposed as CLI commands.

run(command="cat notes.md")
run(command="cat log.txt | grep ERROR | wc -l")
run(command="see screenshot.png")
run(command="memory search 'deployment issue'")
run(command="clip sandbox bash 'python3 analyze.py'")

The LLM still chooses which command to use, but this is fundamentally different from choosing among 15 tools with different schemas. Command selection is string composition within a unified namespace — function selection is context-switching between unrelated APIs.

LLMs already speak CLI

Why are CLI commands a better fit for LLMs than structured function calls?

Because CLI is the densest tool-use pattern in LLM training data. Billions of lines on GitHub are full of:

# README install instructions
pip install -r requirements.txt && python main.py

# CI/CD build scripts
make build && make test && make deploy

# Stack Overflow solutions
cat /var/log/syslog | grep "Out of memory" | tail -20

I don't need to teach the LLM how to use CLI — it already knows. This familiarity is probabilistic and model-dependent, but in practice it's remarkably reliable across mainstream models.

Compare two approaches to the same task:

Task: Read a log file, count the error lines

Function-calling approach (3 tool calls):
  1. read_file(path="/var/log/app.log") → returns entire file
  2. search_text(text=<entire file>, pattern="ERROR") → returns matching lines
  3. count_lines(text=<matched lines>) → returns number

CLI approach (1 tool call):
  run(command="cat /var/log/app.log | grep ERROR | wc -l")
  → "42"

One call replaces three. Not because of special optimization — but because Unix pipes natively support composition.

Making pipes and chains work

A single run isn't enough on its own. If run can only execute one command at a time, the LLM still needs multiple calls for composed tasks. So I make a chain parser (parseChain) in the command routing layer, supporting four Unix operators:

|   Pipe: stdout of previous command becomes stdin of next
&&  And:  execute next only if previous succeeded
||  Or:   execute next only if previous failed
;   Seq:  execute next regardless of previous result

With this mechanism, every tool call can be a complete workflow:

# One tool call: download → inspect
curl -sL $URL -o data.csv && cat data.csv | head 5

# One tool call: read → filter → sort → top 10
cat access.log | grep "500" | sort | head 10

# One tool call: try A, fall back to B
cat config.yaml || echo "config not found, using defaults"

N commands × 4 operators — the composition space grows dramatically. And to the LLM, it's just a string it already knows how to write.

The command line is the LLM's native tool interface.

Heuristic design: making CLI guide the agent

Single-tool + CLI solves "what to use." But the agent still needs to know "how to use it." It can't Google. It can't ask a colleague. I use three progressive design techniques to make the CLI itself serve as the agent's navigation system.

Technique 1: Progressive --help discovery

A well-designed CLI tool doesn't require reading documentation — because --help tells you everything. I apply the same principle to the agent, structured as progressive disclosure: the agent doesn't need to load all documentation at once, but discovers details on-demand as it goes deeper.

Level 0: Tool Description → command list injection

The run tool's description is dynamically generated at the start of each conversation, listing all registered commands with one-line summaries:

Available commands:
  cat    — Read a text file. For images use 'see'. For binary use 'cat -b'.
  see    — View an image (auto-attaches to vision)
  ls     — List files in current topic
  write  — Write file. Usage: write <path> [content] or stdin
  grep   — Filter lines matching a pattern (supports -i, -v, -c)
  memory — Search or manage memory
  clip   — Operate external environments (sandboxes, services)
  ...

The agent knows what's available from turn one, but doesn't need every parameter of every command — that would waste context.

Note: There's an open design question here: injecting the full command list vs. on-demand discovery. As commands grow, the list itself consumes context budget. I'm still exploring the right balance. Ideas welcome.

Level 1: command (no args) → usage

When the agent is interested in a command, it just calls it. No arguments? The command returns its own usage:

→ run(command="memory")
[error] memory: usage: memory search|recent|store|facts|forget

→ run(command="clip")
  clip list                              — list available clips
  clip <name>                            — show clip details and commands
  clip <name> <command> [args...]         — invoke a command
  clip <name> pull <remote-path> [name]   — pull file from clip to local
  clip <name> push <local-path> <remote>  — push local file to clip

Now the agent knows memory has five subcommands and clip supports list/pull/push. One call, no noise.

Level 2: command subcommand (missing args) → specific parameters

The agent decides to use memory search but isn't sure about the format? It drills down:

→ run(command="memory search")
[error] memory: usage: memory search <query> [-t topic_id] [-k keyword]

→ run(command="clip sandbox")
  Clip: sandbox
  Commands:
    clip sandbox bash <script>
    clip sandbox read <path>
    clip sandbox write <path>
  File transfer:
    clip sandbox pull <remote-path> [local-name]
    clip sandbox push <local-path> <remote-path>

Progressive disclosure: overview (injected) → usage (explored) → parameters (drilled down). The agent discovers on-demand, each level providing just enough information for the next step.

This is fundamentally different from stuffing 3,000 words of tool documentation into the system prompt. Most of that information is irrelevant most of the time — pure context waste. Progressive help lets the agent decide when it needs more.

This also imposes a requirement on command design: every command and subcommand must have complete help output. It's not just for humans — it's for the agent. A good help message means one-shot success. A missing one means a blind guess.

Technique 2: Error messages as navigation

Agents will make mistakes. The key isn't preventing errors — it's making every error point to the right direction.

Traditional CLI errors are designed for humans who can Google. Agents can't Google. So I require every error to contain both "what went wrong" and "what to do instead":

Traditional CLI:
  $ cat photo.png
  cat: binary file (standard output)
  → Human Googles "how to view image in terminal"

My design:
  [error] cat: binary image file (182KB). Use: see photo.png
  → Agent calls see directly, one-step correction

More examples:

[error] unknown command: foo
Available: cat, ls, see, write, grep, memory, clip, ...
→ Agent immediately knows what commands exist

[error] not an image file: data.csv (use cat to read text files)
→ Agent switches from see to cat

[error] clip "sandbox" not found. Use 'clip list' to see available clips
→ Agent knows to list clips first

Technique 1 (help) solves "what can I do?" Technique 2 (errors) solves "what should I do instead?" Together, the agent's recovery cost is minimal — usually 1-2 steps to the right path.

Real case: The cost of silent stderr

For a while, my code silently dropped stderr when calling external sandboxes — whenever stdout was non-empty, stderr was discarded. The agent ran pip install pymupdf, got exit code 127. stderr contained bash: pip: command not found, but the agent couldn't see it. It only knew "it failed," not "why" — and proceeded to blindly guess 10 different package managers:

pip install         → 127  (doesn't exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓  (10th try)

10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have been enough.

stderr is the information agents need most, precisely when commands fail. Never drop it.

Technique 3: Consistent output format

The first two techniques handle discovery and correction. The third lets the agent get better at using the system over time.

I append consistent metadata to every tool result:

file1.txt
file2.txt
dir1/
[exit:0 | 12ms]

The LLM extracts two signals:

Exit codes (Unix convention, LLMs already know these):

exit:0 — success
exit:1 — general error
exit:127 — command not found

Duration (cost awareness):

12ms — cheap, call freely
3.2s — moderate
45s — expensive, use sparingly

After seeing [exit:N | Xs] dozens of times in a conversation, the agent internalizes the pattern. It starts anticipating — seeing exit:1 means check the error, seeing long duration means reduce calls.

Consistent output format makes the agent smarter over time. Inconsistency makes every call feel like the first.

The three techniques form a progression:

--help       →  "What can I do?"        →  Proactive discovery
Error Msg    →  "What should I do?"     →  Reactive correction
Output Fmt   →  "How did it go?"        →  Continuous learning

Two-layer architecture: engineering the heuristic design

The section above described how CLI guides agents at the semantic level. But to make it work in practice, there's an engineering problem: the raw output of a command and what the LLM needs to see are often very different things.

Two hard constraints of LLMs

Constraint A: The context window is finite and expensive. Every token costs money, attention, and inference speed. Stuffing a 10MB file into context doesn't just waste budget — it pushes earlier conversation out of the window. The agent "forgets."

Constraint B: LLMs can only process text. Binary data produces high-entropy meaningless tokens through the tokenizer. It doesn't just waste context — it disrupts attention on surrounding valid tokens, degrading reasoning quality.

These two constraints mean: raw command output can't go directly to the LLM — it needs a presentation layer for processing. But that processing can't affect command execution logic — or pipes break. Hence, two layers.

Execution layer vs. presentation layer

┌─────────────────────────────────────────────┐
│  Layer 2: LLM Presentation Layer            │  ← Designed for LLM constraints
│  Binary guard | Truncation+overflow | Meta   │
├─────────────────────────────────────────────┤
│  Layer 1: Unix Execution Layer              │  ← Pure Unix semantics
│  Command routing | pipe | chain | exit code │
└─────────────────────────────────────────────┘

When cat bigfile.txt | grep error | head 10 executes:

Inside Layer 1:
  cat output → [500KB raw text] → grep input
  grep output → [matching lines] → head input
  head output → [first 10 lines]

If you truncate cat's output in Layer 1 → grep only searches the first 200 lines, producing incomplete results. If you add [exit:0] in Layer 1 → it flows into grep as data, becoming a search target.

So Layer 1 must remain raw, lossless, metadata-free. Processing only happens in Layer 2 — after the pipe chain completes and the final result is ready to return to the LLM.

Layer 1 serves Unix semantics. Layer 2 serves LLM cognition. The separation isn't a design preference — it's a logical necessity.

Layer 2's four mechanisms

Mechanism A: Binary Guard (addressing Constraint B)

Before returning anything to the LLM, check if it's text:

Null byte detected → binary
UTF-8 validation failed → binary
Control character ratio > 10% → binary

If image: [error] binary image (182KB). Use: see photo.png
If other: [error] binary file (1.2MB). Use: cat -b file.bin

The LLM never receives data it can't process.

Mechanism B: Overflow Mode (addressing Constraint A)

Output > 200 lines or > 50KB?
  → Truncate to first 200 lines (rune-safe, won't split UTF-8)
  → Write full output to /tmp/cmd-output/cmd-{n}.txt
  → Return to LLM:

    [first 200 lines]

    --- output truncated (5000 lines, 245.3KB) ---
    Full output: /tmp/cmd-output/cmd-3.txt
    Explore: cat /tmp/cmd-output/cmd-3.txt | grep <pattern>
             cat /tmp/cmd-output/cmd-3.txt | tail 100
    [exit:0 | 1.2s]

Key insight: the LLM already knows how to use grep, head, tail to navigate files. Overflow mode transforms "large data exploration" into a skill the LLM already has.

Mechanism C: Metadata Footer

actual output here
[exit:0 | 1.2s]

Exit code + duration, appended as the last line of Layer 2. Gives the agent signals for success/failure and cost awareness, without polluting Layer 1's pipe data.

Mechanism D: stderr Attachment

When command fails with stderr:
  output + "\n[stderr] " + stderr

Ensures the agent can see why something failed, preventing blind retries.

Lessons learned: stories from production

Story 1: A PNG that caused 20 iterations of thrashing

A user uploaded an architecture diagram. The agent read it with cat, receiving 182KB of raw PNG bytes. The LLM's tokenizer turned these bytes into thousands of meaningless tokens crammed into the context. The LLM couldn't make sense of it and started trying different read approaches — cat -f, cat --format, cat --type image — each time receiving the same garbage. After 20 iterations, the process was force-terminated.

Root cause: cat had no binary detection, Layer 2 had no guard. Fix: isBinary() guard + error guidance Use: see photo.png. Lesson: The tool result is the agent's eyes. Return garbage = agent goes blind.

Story 2: Silent stderr and 10 blind retries

The agent needed to read a PDF. It tried pip install pymupdf, got exit code 127. stderr contained bash: pip: command not found, but the code dropped it — because there was some stdout output, and the logic was "if stdout exists, ignore stderr."

The agent only knew "it failed," not "why." What followed was a long trial-and-error:

pip install         → 127  (doesn't exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓

10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have sufficed.

Root cause: InvokeClip silently dropped stderr when stdout was non-empty. Fix: Always attach stderr on failure. Lesson: stderr is the information agents need most, precisely when commands fail.

Story 3: The value of overflow mode

The agent analyzed a 5,000-line log file. Without truncation, the full text (~200KB) was stuffed into context. The LLM's attention was overwhelmed, response quality dropped sharply, and earlier conversation was pushed out of the context window.

With overflow mode:

[first 200 lines of log content]

--- output truncated (5000 lines, 198.5KB) ---
Full output: /tmp/cmd-output/cmd-3.txt
Explore: cat /tmp/cmd-output/cmd-3.txt | grep <pattern>
         cat /tmp/cmd-output/cmd-3.txt | tail 100
[exit:0 | 45ms]

The agent saw the first 200 lines, understood the file structure, then used grep to pinpoint the issue — 3 calls total, under 2KB of context.

Lesson: Giving the agent a "map" is far more effective than giving it the entire territory.

Boundaries and limitations

CLI isn't a silver bullet. Typed APIs may be the better choice in these scenarios:

Strongly-typed interactions: Database queries, GraphQL APIs, and other cases requiring structured input/output. Schema validation is more reliable than string parsing.
High-security requirements: CLI's string concatenation carries inherent injection risks. In untrusted-input scenarios, typed parameters are safer. agent-clip mitigates this through sandbox isolation.
Native multimodal: Pure audio/video processing and other binary-stream scenarios where CLI's text pipe is a bottleneck.

Additionally, "no iteration limit" doesn't mean "no safety boundaries." Safety is ensured by external mechanisms:

Sandbox isolation: Commands execute inside BoxLite containers, no escape possible
API budgets: LLM calls have account-level spending caps
User cancellation: Frontend provides cancel buttons, backend supports graceful shutdown

Hand Unix philosophy to the execution layer, hand LLM's cognitive constraints to the presentation layer, and use help, error messages, and output format as three progressive heuristic navigation techniques.
CLI is all agents need.

Source code (Go):

Core files: internal/tools.go (command routing), internal/chain.go (pipes), internal/loop.go (two-layer agentic loop), internal/fs.go (binary guard), internal/clip.go (stderr handling), internal/browser.go (vision auto-attach), internal/memory.go (semantic memory).

Happy to discuss — especially if you've tried similar approaches or found cases where CLI breaks down. The command discovery problem (how much to inject vs. let the agent discover) is something I'm still actively exploring.

u/JayPatel24_

31 min. ago

Back again with another training problem I keep running into while building dataset slices for smaller LLMs

You can now fine-tune Gemma 4 locally 8GB VRAM + Bug Fixes

Hey, I’m back with another one from the pile of model behaviors I’ve been trying to isolate and turn into trainable dataset slices.

This time the problem is reliable JSON extraction from financial-style documents.

I keep seeing the same pattern:

You can prompt a smaller/open model hard enough that it looks good in a demo.
It gives you JSON.
It extracts the right fields.
You think you’re close.

That’s the part that keeps making me think this is not just a prompt problem.

It feels more like a training problem.

A lot of what I’m building right now is around this idea that model quality should be broken into very narrow behaviors and trained directly, instead of hoping a big prompt can hold everything together.

For this one, the behavior is basically:

Can the model stay schema-first, even when the input gets messy?

Not just:
“can it produce JSON once?”

But:

can it keep the same structure every time
can it make success and failure outputs equally predictable

One of the row patterns I’ve been looking at has this kind of training signal built into it:

{
  "sample_id": "lane_16_code_json_spec_mode_en_00000001",
  "assistant_response": "Design notes: - Storage: a local JSON file with explicit load and save steps. - Bad: vague return values. Good: consistent shapes for success and failure."
}

What I like about this kind of row is that it does not just show the model a format.

It teaches the rule:

vague output is bad
stable structured output is good

That feels especially relevant for stuff like:

financial statement extraction
invoice parsing

So this is one of the slices I’m working on right now while building out behavior-specific training data.

Curious how other people here think about this.

u/danielhanchen

6 days ago

You can now fine-tune Gemma 4 locally 8GB VRAM + Bug Fixes

Llama4 108b $800 setup

u/kylerrr02

7 hr. ago

Llama4 108b $800 setup

Better alternative to CLI and MCP for local tools: Seeking feedback on my open-source project

u/getsentry

•

Promoted

Full visibility into every agent failure.

sentry.io

Learn More

u/PrincipleFar6835

5 hr. ago

Better alternative to CLI and MCP for local tools: Seeking feedback on my open-source project

MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q

u/HealthyCommunicat

22 hr. ago

MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q

mtmd: add Gemma 4 audio conformer encoder support

mtmd: add Gemma 4 audio conformer encoder support

18 hr. ago

Unsloth MiniMax M2.7 quants just finished uploading to HF

https://github.com/ggml-org/llama.cpp/pull/21421

r/LocalLLaMA - mtmd: add Gemma 4 audio conformer encoder support

u/Zyj

1 day ago

Unsloth MiniMax M2.7 quants just finished uploading to HF

Llama 4 Maverick tool calling is significantly less reliable than Llama 3.1 70B in multi-agent systems

They range from Q1 to BF16.

Grab them while they're still hot over at

Thanks to !

Here's the current list:

Bits	Quantization Label	Size
1-bit	UD-IQ1_M	60.7 GB
2-bit	UD-IQ2_XXS	65.4 GB
	UD-IQ2_M	70.1 GB
	UD-Q2_K_XL	75.3 GB
3-bit	UD-IQ3_XXS	80.1 GB
	UD-IQ3_S	83.6 GB
	UD-Q3_K_S	93.6 GB
	UD-Q3_K_M	101 GB
	UD-Q3_K_XL	102 GB
4-bit	UD-IQ4_XS	108 GB
	UD-IQ4_NL	111 GB
	UD-Q4_K_S	131 GB
	MXFP4_MOE	136 GB
	UD-Q4_K_M	140 GB
	UD-Q4_K_XL	141 GB
5-bit	UD-Q5_K_S	159 GB
	UD-Q5_K_M	169 GB
	UD-Q5_K_XL	169 GB
6-bit	UD-Q6_K	188 GB
	UD-Q6_K_XL	207 GB
8-bit	Q8_0	243 GB
	UD-Q8_K_XL	247 GB
16-bit	BF16	457 GB

u/m3m3o

40 min. ago

Llama 4 Maverick tool calling is significantly less reliable than Llama 3.1 70B in multi-agent systems

Obsidian Second Brain Model??

https://mehmetgoekce.substack.com/p/i-swapped-llama-31-70b-for-llama

r/LocalLLaMA - Llama 4 Maverick tool calling is significantly less reliable than Llama 3.1 70B in multi-agent systems

u/220nyx

41 min. ago

Obsidian Second Brain Model??

A simple explanation of the key idea behind TurboQuant

I got a MacBook Pro M4 Pro 24GB Unified RAM

I was wondering if anybody here uses local LLM models as their second brain director for Obsidian.

- Summarise notes

- Link notes

- Tag notes

- Going deeper into the notes

- etc

But my main goal with this is to use a local model to refer to my vault as a RAG pipeline.

I’ve only recently began testing what specific model would be good with this and with my specs, any suggestions?

u/-p-e-w-

16 days ago

A simple explanation of the key idea behind TurboQuant

Aryagm/dflash-mlx: Exact speculative decoding on Apple Silicon, powered by MLX.

TurboQuant () has been all the rage in the past two days, and I've seen lots of comments here attempting to explain the magic behind it. Many of those comments boil down to "dude, it's polar coordinates!!!", and that's really misleading. The most important part has nothing to do with polar coordinates (although they are emphasized in Google's blog post, so the confusion is understandable).

TurboQuant is a vector quantization algorithm. It turns a vector of numbers into another vector of numbers that takes up less memory.

Quantization is a fairly basic operation. If you have an n-dimensional vector that looks like this:

Then a quantized version of that vector may look like this:

0.237
0.723
0.543
0.100
...

Notice how I simply shaved off the last four digits of each number? That's already an example of a crude quantization process. Obviously, there are far more sophisticated schemes, including grouping coefficients in blocks, adaptive thresholds, calibrated precision based on experimental data etc., but at its core, quantization always involves reducing coefficient precision.

Here is the key idea behind TurboQuant: Before quantizing a vector, we randomly rotate it in the n-dimensional space it resides in. The corresponding counter-rotation is applied during dequantization.

That's it.

Now you probably feel that I must have left out an important detail. Surely the rotation can't be completely random? Maybe it's sampled from a particular distribution, or somehow input-dependent? Or perhaps there is another operation that goes hand in hand with it?

Nope. I didn't leave anything out. Just applying a random rotation to the vector dramatically improves quantization performance.

But why?

Because the magnitudes of the coefficients of state vectors in language models aren't distributed uniformly among the vector dimensions. It's very common to see vectors that look like this:

0.0000023
0.9999428  <-- !!!
0.0000738
0.0000003
...

This phenomenon has many names, and it shows up everywhere in transformer research. You can read about "massive activations" () and "attention sinks" (e.g. ) for a deeper analysis.

What matters for the purposes of this explanation is: Vectors with this type of quasi-sparse structure are terrible targets for component quantization. Reducing precision in such a vector effectively turns the massive component into 1 (assuming the vector is normalized), and all other components into 0. That is, quantization "snaps" the vector to its nearest cardinal direction. This collapses the information content of the vector, as identifying a cardinal direction takes only log2(2n) bits, whereas the quantized vector can hold kn bits (assuming k bits per component).

And that's where the random rotation comes in! Since most directions aren't near a cardinal direction (and this only becomes more true as the number of dimensions increases), a random rotation almost surely results in a vector that distributes the coefficient weight evenly across all components, meaning that quantization doesn't cause information loss beyond that expected from precision reduction.

The TurboQuant paper proves this mathematically, and gives an exact description of the distribution behavior, but the intuitive understanding is much more straightforward than that.

This idea isn't new (RaBitQ employs the same trick, and QuIP a similar one), but TurboQuant combines it with a second step that eliminates biases that arise when quantized vectors that are optimal in a certain sense (MSE) are used to compute inner products, which is what happens in attention blocks. See the paper if you're interested in the details.

u/Thrumpwart

7 hr. ago

Aryagm/dflash-mlx: Exact speculative decoding on Apple Silicon, powered by MLX.

ClaudeCode CLI experience but with local LLMs — what are you guys using?

https://github.com/Aryagm/dflash-mlx

u/alfons_fhl

1 hr. ago

ClaudeCode CLI experience but with local LLMs — what are you guys using?

It's insane how lobotomized Opus 4.6 is right now. Even Gemma 4 31B UD IQ3 XXS beat it on the carwash test on my 5070 TI.

Been using ClaudeCode CLI with Opus 4.6 and many MCP's and honestly its addicting.

Just tell it what to build and it does everything — reads the codebase, writes code, runs commands, fixes its own errors. Pure vibe coding.

Now I want the same thing but with Qwen3-Coder-next running locally.

Not copilot autocomplete stuff, I mean the full "build me this feature" autonomous agent experience.

Looked into Cline, Aider, Open Interpreter so far. Cline seems closest but curious what you all are actually using day to day.

Anyone running a solid agentic setup with local models? Whats working, whats not? And what is the best one?

u/FrozenFishEnjoyer

4 days ago

It's insane how lobotomized Opus 4.6 is right now. Even Gemma 4 31B UD IQ3 XXS beat it on the carwash test on my 5070 TI.

Pi & Qwen3.5 with llama-cpp doing a lot of prompt re-processing

u/CometML

•

Promoted

We built an observability + evaluation platform for local LLMs. 100% free and open source.

github.com

u/annodomini

10 hr. ago

Pi & Qwen3.5 with llama-cpp doing a lot of prompt re-processing

Local (small) LLMs found the same vulnerabilities as Mythos

I've noticed an issue when I'm using Pi as a coding agent with llama-cpp, and I'm wondering if there's an issue with Pi or how I have it configured, or if this is just expected behavior.

I'm using Qwen3.5 122b with thinking enabled. When doing a bunch of agentic edits, it will do a lot of interleaving thinking and tool calls. This all works fine.

But then when it comes to my next turn providing input, I get a whole bunch of the context cache invalidated, because it looks like Pi is no longer sending over the thinking blocks. I see this in the llama-cpp log, where you can see that it diverged by dropping the thinking block:

srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.736 (> 0.100 thold), f_keep = 0.703
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  3 | task 29044 | processing task, is_child = 0
slot update_slots: id  3 | task 29044 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 48112
slot update_slots: id  3 | task 29044 | old: ... 
<|im_start|>assistant
 | <think>
The user is saying
slot update_slots: id  3 | task 29044 | new: ... 
<|im_start|>assistant
 | You're right - ball-to
slot update_slots: id  3 | task 29044 |      198  248045   74455     198  248068     198     760    1156     369    5315
slot update_slots: id  3 | task 29044 |      198  248045   74455     198    2523    2224    1245     471    4776    4534
slot update_slots: id  3 | task 29044 | n_past = 35407, slot.prompt.tokens.size() = 50377, seq_id = 3, pos_min = 50376, n_swa = 0

And then it goes on to invalidate a bunch of the context checkpoints and recomputes the cache from point that the history diverged, where the thinking context was dropped.

Now, I haven't dug into this too deeply yet, but I wanted to check: is this behavior expected? Do I have something configured wrong, or is Pi buggy in not sending thinking context from previous turns?

Here's the model config from my models.json in my Pi config:

    {
      "id": "unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL",
      "name": "Qwen3.5 122B-A10B (local)",
      "reasoning": true,
      "input": ["text", "image"],
      "contextWindow": 262144,
      "maxTokens": 65536,
      "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
      "compat": {
        "thinkingFormat": "qwen-chat-template"
      }
    },

u/CyberAttacked

4 days ago

Local (small) LLMs found the same vulnerabilities as Mythos

MOSS-TTS-Nano: a 0.1B open-source multilingual TTS model that runs on 4-core CPU and supports realtime speech generation

https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier

r/LocalLLaMA - Local (small) LLMs found the same vulnerabilities as Mythos

u/TimeEnvironmental219

19 hr. ago

MOSS-TTS-Nano: a 0.1B open-source multilingual TTS model that runs on 4-core CPU and supports realtime speech generation

It finally happened, I actually had a use case for a local LLM and it was brilliant

We just open-sourced MOSS-TTS-Nano, a tiny multilingual speech generation model from and the OpenMOSS team.

Some highlights:

0.1B parameters
Realtime speech generation
Runs on CPU without requiring a GPU
Multilingual support (Chinese, English, Japanese, Korean, Arabic, and more)
Streaming inference
Long-text voice cloning
Simple local deployment with , , and CLI commands

The project is aimed at practical TTS deployment: small footprint, low latency, and easy local setup for demos, lightweight services, and product integration.

GitHub:

Huggingface:

Online demo:

Would love to hear feedback on quality, latency, and what use cases you’d want to try with a tiny open TTS model.

u/EntertainerFew2832

5 days ago

It finally happened, I actually had a use case for a local LLM and it was brilliant

Made my messy notes actually usable

I've had aerosinusitis a few times before in my life and it was fairly painful, but not something that happens often. Today on a flight I had an overwhelming bout of it, the pressure was genuinely unbearable, and I had no painkillers with me.

I was on a cheap flight, in the cheap seats so no Wifi.

I've been playing around with local LLMs on my laptop for a year or so, but it's always been pure novelty. It suddenly dawned on me that I could use Gemma 4 mid-air, and so I pulled out my laptop and asked for any way I could possibly reduce the pain.

The Toynbee Maneuver, which I had never in my life heard of, slowly but surely relieved the pressure. Within 10 mins I felt completely fine.

It may sound trivial, but without local AI I would have been in blinding pain for probably 90 mins – so it was a rare moment when new technology actually makes a palpable difference to your life.

Sharing this here because my wife didn't care and I felt if anyone would appreciate this small win it would be this community.

u/knlgeth

1 hr. ago

Made my messy notes actually usable

FernflowerAI-35B-A3B-KL-ReLU-GGUF + Apple MLX

I used to spend way too much time trying to keep my notes clean across docs, PDFs, and random files… and it never really stayed organized anyway.

Recently tried just dumping everything into this repo: and letting it compile things into a wiki automatically.

It's core loop:

sources → compile → wiki → query → save → richer wiki

Now I barely organize anything myself, it just structures everything in a way that actually makes sense when I come back to it.

Give it a spin and let me know what you think:)

u/EvilEnginer

19 hr. ago

FernflowerAI-35B-A3B-KL-ReLU-GGUF + Apple MLX

Model available here:

Qwen 3.5 35B A3B Uncensored HauhauCS (repaired) -> (now with KL + ReLU calibration)

Repair summary:

Extra information about how Qwen 3.5 35B got broken (and how I fixed it):

V1 Apple MLX version (thanks to ):

V2 Apple MLX version (final release):

History:
Hello everyone. A few days ago I released a fixed version of - two broken tensors that Alibaba shipped with Qwen 3.5 35B A3B model, due to heavy complexity and bug during training process in AdamW optimizer ssm_conv1d.weight in blocks 36-37 were scaled back to normal. That fixed the major context collapse and looping. But after more testing, I found that some other tensors (experts, attention projections) had a subtler problem. Their overall scale and saturation looked fine, but the shape of their weight distribution was drifting away from the peer group. C1 and C2 didn't catch this. C3 (KL divergence) did.

So I added two more criteria to the diagnostic pass:

KL divergence - restores the distribution shape of tensors that drifted from their peer group without changing scale or saturation.
ReLU asymmetry - detects mean drift that AdamW can accumulate over time (didn't fire on this model, but the probe is there for others).

Results on this version:

Metric	Before	After
KL divergence (average)	0.1036	0.0297
KL reduction	—	71.3%
Repaired tensors (C2 + C3)	2	11

What this means for you:

The model was already stable after v1. Now it's tighter - fewer hidden distribution anomalies that could cause weird behavior on very long or complex tasks.
No new problems introduced. The 489 healthy tensors were left untouched.

Upgraded system prompt that unlocks deep thinking (works great with this model):

Also you can use only one string in System Prompt. And add anything you want after it:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

Quantization script available here:

Updated chat template: (with tool fixes from and disabled thinking)

Recommended Settings (LM Studio):

Temperature	0.7
Top K Sampling	20
Presence Penalty	1.5
Repeat Penalty	Disabled or 1.0
Top P Sampling	0.8
Min P Sampling	0
Seed	3407

Enjoy ^_^

MiniMax M2.7 is NOT open source - DOA License :(

u/KvAk_AKPlaysYT

1 day ago

MiniMax M2.7 is NOT open source - DOA License :(

huge improvement after moving from ollama to llama.cpp

Commercial use is banned without prior written permission from MiniMax.

And their definition of "commercial" is broad - covers paid services, commercial APIs, and even deploying a fine-tuned version for profit. Military use is also explicitly prohibited- interesting.

So you can't use the model or any outputs for anything commercial!

I'm really starting to hate these "open weights, closed license" models...

u/leonardosalvatore

1 day ago

huge improvement after moving from ollama to llama.cpp

Is funny to set different parameters and watch it.
Code:

Those are tiny robots fighting each other to survive.
Between matches only one class of robots are driven by qwen3 coder generated code and it does improve match after match...

Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run

u/Disastrous_Theme5906

8 days ago

Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run

Qwen 3.5 28B A3B REAP for coding initial impressions

u/ag789

11 hr. ago

Qwen 3.5 28B A3B REAP for coding initial impressions

the original models in hf are here:

unsloth contributed various quants

this is a follow up for

I'd guess given the comments I've reviewed Qwen 3.5 (and Gemma 4) are deemed among the best models published for public consumption.

among the models I tried are, on my plain old haswell i7 cpu 32 gb dram, all Q4_K_M quants
unsloth/Qwen3.5-27B-GGUF 0.95 tokens / s
unsloth/Qwen3.5-35B-A3B-GGUF 4 tokens / s

barozp/Qwen-3.5-28B-A3B-REAP-GGUF 7.5 tokens / s

tokens / s degrades as context becomes larger e.g. when following up with prompts in the same context / thread. it could be from that 7.5 gradually down to 1 tok/s

What I used is the Qwen-3.5-28B-A3B-REAP-GGUF as that is 'small' enough to deliver a barely adequate throughput (7.5 t/s) on my hardware.

---
Initial impressions are that Qwen 3.5 tends to mention related concerns / references. And in llama.cpp, it does pretty verbose 'thinking' / planning steps before reverting with the actual response.

The mentions of related stuff, makes it a good documenter and I actually tasked it to analyse the codes of a shell script and prepare usage documentation for the using the shell script. It does it pretty well in a nicely formatted markdown texts.

Code proposals is good (and some ok), but the most interesting stuff as I always try to get llms to do, probably 'difficult' stuff for these small LLMs is to *refactor* codes.

I asked it to refactor a shell script, fixing some bugs, and adapt it to some structural changes in data (e.g. the json format of data), quite complex a task I'd think for such 'small' llm, it burns through some > 10k tokens in the 'thinking' phase, but eventually did reverted with refactored codes. I'd guess that this llm is kind of 'careful' I've seen it iterating over (same) issues with 'wait ... ` , considering the dependencies / issues. The resulting codes are 'not a best refactoring' , i'd guess it tried to follow the requirements of my prompt closely.

among the things is a recursive proposal , i.e. refactor the data json structure, then to refactor the shell script to handle the refactored new data structure. it refactored the json data structure , but misses on updating the shell script to work with the new structure. it takes a second run with the new data structure and script for the new structure to be considered.
in addition, that if the prompt is 'too ambigious', it can go in loops in the 'thinking' phase trying to resolve those ambiguity, as seen in the 'thinking' phase, I tend to need to stop the inference, and restructure my prompt so that it is more specific, and that helps to get to the solution.

u/InworldAI

•

Promoted

Inworld TTS just ranked #1 on Artificial Analysis, beating ElevenLabs and MiniMax. Try Inworld TTS.

inworld.ai

Learn More

Gemma 4 26b A3B is mindblowingly good , if configured right

u/cviperr33

6 days ago

Gemma 4 26b A3B is mindblowingly good , if configured right

i've switched to llama.ccp now , , read this post it has some very valuable info if you want to run gemma 4 as efficiently as possible.

Last few days ive been trying different models and quants on my rtx 3090 LM studio , but every single one always glitches the tool calling , infinite loop that doesnt stop. But i really liked the model because it is rly fast , like 80-110 tokens a second , even on high contex it still maintains very high speeds.

I had great success with tool calling in qwen3.5 moe model , but the issue i had with qwen models is that there is some kind of bug in win11 and LM studio that makes the prompt caching not work so when the convo hits 30-40k contex , it is so slow at processing prompts it just kills my will to work with it.

Gemma 4 is different , it is much better supported on the ollama cpp and the caching works flawlesly , im using flash attention + q4 quants , with this i can push it to literally maximum 260k contex on rtx 3090 ! , and the models performs just aswell.

I finally found the one that works for me , its the unsloth q3k_m quant , temperature 1 and top k sampling 40. i have a custom system prompt that im using which also might be helping.

I've been testing it with opencode for the last 6 hours and i just cant stop , it cannot fail , it exiplained me the whole structure of the Open Code itself , and it is a huge , like the whole repo is 2.7GB so many lines of code and it has no issues traversing around and reading everything , explaining how certain things work , i think im gonna create my own version of open code in the end.

It honestly feels like claude sonnet level of quality , never fails to do function calling , i think this might be the best model for agentic coding / tool calling / open claw or search engine.
I prefer it over perplexity , in LM studio connected to search engine via a plugin delivers much better results than perplexity or google.

As for vram consumption it is heavy , it can probably work on 16gb it not for tool calling or agents , u need 10-15k contex just to start it. My gpu has 24gb ram so it can run it at full contex no issues on Q4_0 KV

------------------------------- Quick update post -----------------------------------------------------------------

I'm running the IQ4_X_S quant now by unsloth , full contex size 260k , 94-102 tk/s 20-21GB vram usage , q4 K_V

Weekend project with Intel B70s

u/dev_is_active

1 day ago

Weekend project with Intel B70s

Should I Buy the RTX PRO 6000 Blackwell Max-Q (96GB)?

u/0bjective-Guest

12 hr. ago

Should I Buy the RTX PRO 6000 Blackwell Max-Q (96GB)?

Mistral AI to release Voxtral TTS, a 3-billion-parameter text-to-speech model with open weights that the company says outperformed ElevenLabs Flash v2.5 in human preference tests. The model runs on about 3 GB of RAM, achieves 90-millisecond time-to-first-audio, supports nine languages.

I’m pretty new to the local AI world. So far, I’ve just been running small models on my mobile workstation (12GB VRAM) to help with my research in Obsidian and managing my Paperless-ngx setup. It’s been cool, but I definitely hit a wall when trying to run anything bigger or more "intelligent", for my use case however not really necessary (I also pay for Claude Pro but usage limits have lately been horrendous, but that's another topic).

I just stumbled across a deal on an NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB). It’s not significantly discounted (around 10% off), but I think the price is not bad (around 9700 USD).

I know these cards are rare and usually meant for big labs, but I’m tempted because I want to run the really powerful models (like the new Gemma 4 or DeepSeek) at home and access them from all my devices without relying on subscriptions.

My questions for the experts:

Is 96GB VRAM basically "endgame" for a single-user setup, or would I be better off with something cheaper?
Do people use such stuff for what I want to use them (running powerful local LLMs) or rather for AI training or something else?
Would I have to build a custom PC to use it? How do I go from a GPU to actually using it?

I don't want to miss a rare price opportunity, but I also don't want to buy a piece of hardware I’ll never fully utilize. What would you do?

u/Nunki08

18 days ago

Intel NPU cannot run a LLM, can it?

u/wossnameX

14 hr. ago

Intel NPU cannot run a LLM, can it?

Every day I wake up and thank God for having me be born 23 minutes away from a MicroCenter

I think so. And the ARC iFGX on many laptops is "good enough" for many use-cases.

I wrote code to for a work-project under GDPR; Worked well enough. 15.000 images compared overnight; Took about 7 hours.
Slow, but secure.

u/gigaflops_

6 days ago

Every day I wake up and thank God for having me be born 23 minutes away from a MicroCenter

Minimax M2.7 Released

u/decrement--

1 day ago

Minimax M2.7 Released

FlashLM v8.3 (6.5M CORTEX) beats v5.2 Transformer baseline — same 2h CPU, same data

https://huggingface.co/MiniMaxAI/MiniMax-M2.7

u/Own-Albatross868

13 hr. ago

FlashLM v8.3 (6.5M CORTEX) beats v5.2 Transformer baseline — same 2h CPU, same data

LLM on the go - Testing 25 Model + 150 benchmarks for Asus ProArt Px13 - StrixHalo laptop

After iterating from v6 to v8.3, FlashLM v8.3 outperforms the Transformer baseline on TinyStories generation quality.

Both models trained under identical constraints:

Hardware: 2 vCPU / 5GB RAM (free-tier cloud CPU)
Time budget: 2 hours wall-clock
Dataset: TinyStories (same tokenizer, vocab 4096)
Training: from scratch, no pretraining, no distillation

The only variable is architecture.

Models Compared

Model	Architecture	Params	Training Tokens	PPL
v5.2 "Nova-Ignition"	Transformer + RoPE	5.0M	full 574M (0.027 epochs)	10.56
v8.3 "CORTEX-VIII"	SWA + Gated Delta Memory	6.5M	10M subset (1.5 epochs)	2.50

Note: v5.2 had to train on the full dataset because the 2h budget only allowed 0.027 epochs. v8.3's architecture efficiency allows 1.5 full epochs in the same time.

Generation Samples

Same generation parameters for both models: temperature=1.2, top_k=40 (v5.2) / top_p=0.85 (v8.3), max_tokens=100.

Prompt: "Once upon a time"

v5.2 (Transformer)	v8.3 (CORTEX)
`Once upon a time on not pen cl nd grab wal . ily L , pl baby Sue dir , jump . aces park so luffy rec , igh made 's Lily star G began not gether ell G Tim ...`	`Once upon a time . sun like . helped look this !" began bed to . thought cake a and fish him Tom Mr Bunny fish . looked Ben place ! thinks book ?" butterfly the had and .`

Prompt: "The little girl"

v5.2 (Transformer)	v8.3 (CORTEX)
`The little girl ame <	making c tak . nd ould One very His iled ay asked etter eating . ily too ay star j , help were ra se star re ook nicer r big poin .`

Prompt: "One day a cat"

v5.2 (Transformer)	v8.3 (CORTEX)
`One day a cat B er fused . nd V rot his , en Spot re M mommy r c loud . day too ay came made ot ven . day ought un there , pl cry not gether ell cl special there wal er L , pl coffee , help not Dad after by ap mommy .`	`One day a cat . wanted and . laughed the but she . looked looked Tom the . lived in ! did do do , in said had ." girl her and tree pretty loved home school rest She She tea every .`

Observations

v5.2 (Transformer) produces random word fragments. It never forms a complete sentence. This is expected — 5M params and 0.027 epochs simply isn't enough for a Transformer to learn syntax.
v8.3 (CORTEX) shows clear syntactic structure. Subject-verb-object patterns appear (helped talk, wanted go, laughed the but she). Characters are named (Tom, Tim, Mr Bunny), actions are sequenced, and there's even a hint of emotion (loved home school rest).
The repetition problem is largely solved. v8.1 used to output Lily Lily Lily Lily endlessly. v8.3 occasionally repeats (play play, do do do) but recovers and continues.
PPL and generation quality are decoupled at this scale. v8.3's PPL (2.50) is worse than v7.4's (2.33), yet v8.3 generates much better text. Multiple epochs matter more than pure PPL for tiny models.

What Changed from v8.1 to v8.3?

Subset training: 10M tokens instead of full 574M → 1.5 epochs in 2h (v8.1 only saw 0.027 epochs).
Entropy regularization in loss (weight=0.01) — prevents peaked distributions.
Zero weight decay on embedding/head — preserves low-frequency token distinctions.
SWA window reduced to 32, FFN kept at 512 — better throughput, same expressiveness.
Lookahead value heads down-weighted — they didn't help generation.

Limitations (Honest)

Still not fluent. Sentences are broken, grammar is shaky. 6.5M parameters is below the "syntax threshold" for English (~10-20M).
TinyStories only. This isn't a general-purpose LLM.
v5.2 is 5M, v8.3 is 6.5M. The quality gap is too large to be explained by 1.5M extra params, but I'll be testing a 5M CORTEX variant to make the comparison perfectly matched.

Why This Matters

FlashLM's goal isn't to beat Llama-3. It's to find the highest possible intelligence density under extreme constraints.

CORTEX-VIII combines:

Sliding Window Attention (local, O(T))
Gated Delta Memory (global, linear recurrence)
Ternary-friendly design (though this run used float32 for speed)

At 6.5M params and 2h CPU training, a linear-complexity architecture is already beating a Transformer on generation quality. That's a small but real data point for the "efficient architecture" camp.

Code & Weights:

Questions welcome — happy to share training logs, hyperparameter sweeps, or failed experiments. The v6→v7 graveyard is especially educational.

u/Willing-Toe1942

14 hr. ago

LLM on the go - Testing 25 Model + 150 benchmarks for Asus ProArt Px13 - StrixHalo laptop

turning my phone into a local AI server (open source project update)

So I wanted a portable 13 inch laptop that can be a little LLM monster when needed, Asus did an amazing job with their new 2026 PX13 laptopn powered by strixhalo 128G unified memeory APU

I made benchmark automation system for the amazing toolboxs repo here:

This repo gives you multiple ready to use llamacpp builds with rocm and vulkan

my script is setting the power profile to either (power saving or high performance) then benchmark with llama-bench all the provided gguf with 3 diffrent llama backend (vulkan/rocm nightly/amdvlk)

the overall benchmark for 25 models (varies from 4B to 120B) with all diffrent backends and powerprofils, this took almost 12 hours with average time 4 ~ 5 minutes per run for each model at each configuration

side note: I tested multiple "heretic/hauhau versions" of the mainstream model because I found they are much efficient at thinking process and I saw littel increase in their coding performance comparing to original ones (with some drop in transaltions tasks)

Here is the visualized leaderboard

for power profile power saving I saw consumption near 40 watt and for performance it varies from 60 - 77 watt

------------

llama-bench ProArt PX13 HN7306EAC with strix halo toolboxes

Machine model: ProArt PX13 HN7306EAC
CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Architecture: x86_64
Kernel: 7.0.0-rc7-2-cachyos-rc
OS: CachyOS n/a
OS Version: n/a
Toolboxes: ['llama-rocm7-nightlies', 'llama-vulkan-amdvlk', 'llama-vulkan-radv']
Mode: medium
Power Profiles: ['performance', 'power-saver']
Prompt tokens: 1024,4096,8192,16384
Generation tokens: 512,2048
Repetitions: 1

Leaderboard (sorted by Token Generation/Second)

Rank	Model	Best Gen Backend	Power Profile	Prompt/Gen Tokens (Gen)	Best Gen TPS	Best Prompt Backend	Prompt/Gen Tokens (Prompt)	Best Prompt TPS
1	Marco-Nano-Instruct.Q8_0.gguf	llama-vulkan-radv	Performance	512	211.325	llama-vulkan-radv	1024	4296.133
2	Marco-Mini-Instruct.Q8_0.gguf	llama-vulkan-radv	Performance	512	165.874	llama-vulkan-radv	1024	2329.999
3	OpenAI-20B-NEO-CODEPlus-Uncensored-IQ4_NL.gguf	llama-vulkan-radv	Performance	512	86.033	llama-rocm7-nightlies	1024	1347.876
4	gpt-oss-20b-Derestricted-MXFP4_MOE.gguf	llama-vulkan-radv	Performance	512	74.471	llama-rocm7-nightlies	1024	1317.919
5	gpt-oss-20b-heretic.MXFP4_MOE.gguf	llama-vulkan-radv	Performance	512	74.356	llama-vulkan-radv	1024	1323.742
6	Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf	llama-vulkan-amdvlk	Performance	512	69.059	llama-vulkan-radv	1024	917.500
7	Qwen3.5-35B-A3B-heretic.Q4_K_M.gguf	llama-vulkan-amdvlk	Performance	512	69.001	llama-vulkan-radv	1024	928.552
8	LFM2-24B-A2B-Q8_0.gguf	llama-vulkan-amdvlk	Power Saver	512	60.739	llama-rocm7-nightlies	1024	1456.713
9	Qwen3.5-35B-A3B-Q4_K_M.gguf	llama-vulkan-amdvlk	Power Saver	512	59.614	llama-rocm7-nightlies	1024	911.428
10	Qwen3.5-4B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf	llama-vulkan-radv	Performance	512	59.263	llama-vulkan-radv	1024	1716.063
11	Qwen3.5-4B-UD-Q4_K_XL-unsloth-v2.gguf	llama-vulkan-radv	Performance	512	56.642	llama-vulkan-radv	4096	1600.179
12	gemma-4-26B-A4B-it-UD-Q3_K_M.gguf	llama-vulkan-radv	Performance	512	55.191	llama-rocm7-nightlies	1024	1044.901
13	gemma-4-26B-A4B-it-UD-IQ4_XS.gguf	llama-vulkan-radv	Performance	512	52.416	llama-rocm7-nightlies	1024	1510.919
14	bartwoski_Qwen3.5-35B-A3B-Q4_K_M.gguf	llama-vulkan-amdvlk	Power Saver	512	51.307	llama-rocm7-nightlies	1024	783.849
15	gemma-4-26B-A4B-it-UD-Q4_K_XL (1).gguf	llama-vulkan-radv	Performance	512	49.469	llama-rocm7-nightlies	1024	1620.560
16	Qwen3-Coder-Next-UD-IQ1_M.gguf	llama-vulkan-radv	Power Saver	512	48.834	llama-vulkan-radv	1024	472.070
17	Qwen3.5-35B-A3B-UD-Q4_K_XL-unsloth-v2.gguf	llama-vulkan-amdvlk	Power Saver	512	46.992	llama-rocm7-nightlies	1024	1009.841
18	bartwoski_Qwen3-Coder-Next-IQ4_XS.gguf	llama-vulkan-radv	Power Saver	512	41.375	llama-vulkan-radv	1024	615.839
19	kldzj_gpt-oss-120b-heretic-v2-MXFP4_MOE-00001-of-00002.gguf	llama-rocm7-nightlies	Power Saver	512	40.004	llama-vulkan-radv	1024	432.180
20	Qwen_Qwen3-Coder-Next-IQ4_XS.gguf	llama-vulkan-radv	Power Saver	0/2048	39.801	llama-vulkan-radv	1024	621.813
21	Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf	llama-vulkan-radv	Performance	512	36.393	llama-rocm7-nightlies	1024	953.875
22	Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive-IQ3_XXS.gguf	llama-vulkan-radv	Power Saver	512	27.562	llama-rocm7-nightlies	1024	186.736
23	omnicoder-2-9b-q8_0.gguf	llama-vulkan-radv	Performance	512	23.944	llama-rocm7-nightlies	1024	986.071
24	bartwoski_Qwen3.5-122B-A10B-IQ3_XXS-00001-of-00002.gguf	llama-vulkan-radv	Power Saver	512	23.206	llama-rocm7-nightlies	1024	234.785
25	unsloth-Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf	llama-vulkan-radv	Power Saver	512	20.771	llama-rocm7-nightlies	1024	194.398

Leaderboard (sorted by Prompt Processing T/Second)

Rank	Model	Best Gen Backend	Power Profile	Prompt/Gen Tokens (Gen)	Best Gen TPS	Best Prompt Backend	Prompt/Gen Tokens (Prompt)	Best Prompt TPS
1	Marco-Nano-Instruct.Q8_0.gguf	llama-vulkan-radv	Performance	512	211.325	llama-vulkan-radv	1024	4296.133
2	Marco-Mini-Instruct.Q8_0.gguf	llama-vulkan-radv	Performance	512	165.874	llama-vulkan-radv	1024	2329.999
3	Qwen3.5-4B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf	llama-vulkan-radv	Performance	512	59.263	llama-vulkan-radv	1024	1716.063
4	gemma-4-26B-A4B-it-UD-Q4_K_XL (1).gguf	llama-vulkan-radv	Performance	512	49.469	llama-rocm7-nightlies	1024	1620.560
5	Qwen3.5-4B-UD-Q4_K_XL-unsloth-v2.gguf	llama-vulkan-radv	Performance	512	56.642	llama-vulkan-radv	4096	1600.179
6	gemma-4-26B-A4B-it-UD-IQ4_XS.gguf	llama-vulkan-radv	Performance	512	52.416	llama-rocm7-nightlies	1024	1510.919
7	LFM2-24B-A2B-Q8_0.gguf	llama-vulkan-amdvlk	Power Saver	512	60.739	llama-rocm7-nightlies	1024	1456.713
8	OpenAI-20B-NEO-CODEPlus-Uncensored-IQ4_NL.gguf	llama-vulkan-radv	Performance	512	86.033	llama-rocm7-nightlies	1024	1347.876
9	gpt-oss-20b-heretic.MXFP4_MOE.gguf	llama-vulkan-radv	Performance	512	74.356	llama-vulkan-radv	1024	1323.742
10	gpt-oss-20b-Derestricted-MXFP4_MOE.gguf	llama-vulkan-radv	Performance	512	74.471	llama-rocm7-nightlies	1024	1317.919
11	gemma-4-26B-A4B-it-UD-Q3_K_M.gguf	llama-vulkan-radv	Performance	512	55.191	llama-rocm7-nightlies	1024	1044.901
12	Qwen3.5-35B-A3B-UD-Q4_K_XL-unsloth-v2.gguf	llama-vulkan-amdvlk	Power Saver	512	46.992	llama-rocm7-nightlies	1024	1009.841
13	omnicoder-2-9b-q8_0.gguf	llama-vulkan-radv	Performance	512	23.944	llama-rocm7-nightlies	1024	986.071
14	Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf	llama-vulkan-radv	Performance	512	36.393	llama-rocm7-nightlies	1024	953.875
15	Qwen3.5-35B-A3B-heretic.Q4_K_M.gguf	llama-vulkan-amdvlk	Performance	512	69.001	llama-vulkan-radv	1024	928.552
16	Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf	llama-vulkan-amdvlk	Performance	512	69.059	llama-vulkan-radv	1024	917.500
17	Qwen3.5-35B-A3B-Q4_K_M.gguf	llama-vulkan-amdvlk	Power Saver	512	59.614	llama-rocm7-nightlies	1024	911.428
18	bartwoski_Qwen3.5-35B-A3B-Q4_K_M.gguf	llama-vulkan-amdvlk	Power Saver	512	51.307	llama-rocm7-nightlies	1024	783.849
19	Qwen_Qwen3-Coder-Next-IQ4_XS.gguf	llama-vulkan-radv	Power Saver	0/2048	39.801	llama-vulkan-radv	1024	621.813
20	bartwoski_Qwen3-Coder-Next-IQ4_XS.gguf	llama-vulkan-radv	Power Saver	512	41.375	llama-vulkan-radv	1024	615.839
21	Qwen3-Coder-Next-UD-IQ1_M.gguf	llama-vulkan-radv	Power Saver	512	48.834	llama-vulkan-radv	1024	472.070
22	kldzj_gpt-oss-120b-heretic-v2-MXFP4_MOE-00001-of-00002.gguf	llama-rocm7-nightlies	Power Saver	512	40.004	llama-vulkan-radv	1024	432.180
23	bartwoski_Qwen3.5-122B-A10B-IQ3_XXS-00001-of-00002.gguf	llama-vulkan-radv	Power Saver	512	23.206	llama-rocm7-nightlies	1024	234.785
24	unsloth-Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf	llama-vulkan-radv	Power Saver	512	20.771	llama-rocm7-nightlies	1024	194.398
25	Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive-IQ3_XXS.gguf	llama-vulkan-radv	Power Saver	512	27.562	llama-rocm7-nightlies	1024	186.736

Here is more detailed tables with exact context length for each run

u/amithatprogrammer

13 hr. ago

turning my phone into a local AI server (open source project update)

Gemma 4 26B on oMLX with OpenCode, M4 Max, 64GB unified - am I doing something wrong/miscalibrated on capabilities here?

I made an app A.I.R.I, it runs LLMs locally on your phone. I’ve made a pretty big upgrade from its initial release and it’s starting to feel like something more than just a chat app.

The main idea now is: your phone = a personal AI server

It can:

run models locally
be accessed by other devices on your Wi-Fi
support voice conversations (TTS + STT)
handle documents with a simple RAG pipeline
manage and download models inside the app
keep chat history + user profiles for context
I also completely refactored the architecture so it’s modular and easier to extend (which was badly needed).

Still a work in progress, but this is the first time it feels like the original idea is actually working. Repo:

u/camassist

•

Promoted

CAM Assistを使えば3軸や3+2軸のパーツを数分でプログラム。⬇️

cloudnc.com

Learn More

u/DarthLoki79

12 hr. ago

Gemma 4 26B on oMLX with OpenCode, M4 Max, 64GB unified - am I doing something wrong/miscalibrated on capabilities here?

Ooh, new drama just dropped 👀

So this might very well be user error on my end but please let me know if whatever I am doing is somehow wrong:

M4 Max (highest core count version), 64GB of unified memory
Using oMLX 0.3.5dev1 version for serving, gemma 4bit it 26-a4b (200k context)
Opencode harness for running the model - no custom instructions for now

Consistently I see the LLM not doing what it is said to do. For example - I have some here:

Don't see it thinking all the time. I have it as "high" variant in opencode which sets the thinkingBudget to 8092 tokens, and have "forced" it to do so within oMLX with the chat template, thinking budget, - but it does not always think. For some reason - it also stops after saying it will do a certain tool call but it does not. I don't know if this is a result of the qwen reasoning parser that I'm using or not? If anyone is using oMLX - let me know what reasoning_parser you are using.
Another random question I have is -- I'm seeing a lot of people run this on my hardware - that the token generation speeds are much higher - however they are using lesser context (I'm using 200k). Is that the reason or am I doing something else wrong here?
It goes into repetition loops. I am using default repetition penalty but sometimes its just bad (this was with oMLX v0.3.3 so maybe this has been patched in since) Screenshot for this also attached:

So this has been my experience - let me know if I'm doing anything obviously wrong or whether this is a case where I just simply have to tone down my expectations. I know I can't have SOTA like expectations for model of this size but idk if I'm miscalibrated or not - But I think because a lot of hype with this Gemma 4 release - I thought it would be something that is able to call tools reliably vs my experience with some older models (GPT-OSS 20B/Qwen 3 Next/Qwen 3 coder models - the gpt 20b version used to do this "I'll call the tool" and would just stop - the qwen models were better)

So not sure whether this is a calibration problem/I don't have a proper system prompt that works well with this model on opencode/I have some settings that are wrong.

u/Careful_Equal8851

24 days ago

Ooh, new drama just dropped 👀

Considering ditching Claude/Codex completely

u/Adorable_Weakness_39

18 hr. ago

Considering ditching Claude/Codex completely

Netflix just dropped their first public model on Hugging Face: VOID: Video Object and Interaction Deletion

They have become completely unusable over the past few days.

A few things I have noticed:

- Codex has cut its 5-hour session cap massively so now you can barely tell it to program fizz buzz before running out of tokens.

- Claude Code has the same problem.

They have both just massively dropped in intelligence as well. I have heard people on X talking about how Anthropic models are being throttled in terms of intelligence (for non API tokens). I have had the same problem with GPT-5.4 where it just refuses to do stuff and has a bias to not take actions even if explicitly stated (which I've heard is a byproduct of limiting reasoning tokens).

This causes people to have to send more messages which then uses even more input & output tokens.

Might take the open-souce pill. Perhaps Qwen3.5 27B locally, and GLM5.1 on the cloud.

u/Nunki08

10 days ago

Netflix just dropped their first public model on Hugging Face: VOID: Video Object and Interaction Deletion

Built an OSS tool that uses local LLMs to generate codebase cliff notes, code tours, and architecture analysis from any Git repo

u/Money-Information

2 hr. ago

Built an OSS tool that uses local LLMs to generate codebase cliff notes, code tours, and architecture analysis from any Git repo

6 days ago

https://huggingface.co/zai-org/GLM-5.1

Final voting results for Qwen 3.6

3 days ago

Final voting results for Qwen 3.6

How do you stop codebase from degenerating into an un-maintainable AI-slop mess?

u/DeltaSqueezer

21 hr. ago

How do you stop codebase from degenerating into an un-maintainable AI-slop mess?

GLM 5.1 tops the code arena rankings for open models

What techniques help to reap the benefits of AI code without it accumulating into massive technical debt requiring costly re-writes?

u/Auralore

3 days ago

GLM 5.1 tops the code arena rankings for open models

Opus = 0.5T × 10 = ~5T parameters ?

u/Wonderful-Ad-5952

4 days ago

Opus = 0.5T × 10 = ~5T parameters ?

MiniMaxAI/MiniMax-M2.7 is here!

u/DimraethDev

•

Promoted

Love theorycrafting ARPG builds? Become immortal in Dimraeth. Wishlist on Steam

dimraeth.com

Visit Store

u/KvAk_AKPlaysYT

1 day ago

MiniMaxAI/MiniMax-M2.7 is here!

Did I just destroy a brand new motherboard?

https://huggingface.co/MiniMaxAI/MiniMax-M2.7

r/LocalLLaMA - MiniMaxAI/MiniMax-M2.7 is here!

u/life_coaches

1 day ago

Did I just destroy a brand new motherboard?

Where to get professional help for vibecoding

u/Forward_Compute001

2 hr. ago

Where to get professional help for vibecoding

Turns out Gemma 4 had MTP (multi token prediction) all along

I'm thinking about vibecoding the next part of my project, but I will probably with a lot of confidence need someone with a lot more experience, someone or a company that can help me figure out everything to learn this fast or assist me.

The scope and amount of code is relatively small and not complicated, its rather small snippets. (I think I can provide the precise architecture)

Has someone any idea where I can find assistance?

u/Electrical-Monitor27

6 days ago

Turns out Gemma 4 had MTP (multi token prediction) all along

Gemma 4 on Llama.cpp should be stable now

u/ilintar

4 days ago

Gemma 4 on Llama.cpp should be stable now

Meta released new paper : Neural Computers

With the merging of , all of the fixes to known Gemma 4 issues in Llama.cpp have been resolved. I've been running Gemma 4 31B on Q5 quants for some time now with no issues.

Runtime hints:

remember to run with `--chat-template-file` with the interleaved template Aldehir has prepared (it's in the llama.cpp code under models/templates)
I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems
running KV cache with Q5 K and Q4 V has shown no large performance degradation, of course YMMV

Have fun :)

(oh yeah, important remark - when I talk about llama.cpp here, I mean the *source code*, not the releases which lag behind - this refers to the code built from current master)

Important note about building: DO NOT currently use CUDA 13.2 as it is CONFIRMED BROKEN (the NVidia people are on the case already) and will generate builds that will not work correctly.

u/EducationalImage386

1 day ago

Meta released new paper : Neural Computers

The tried to make me go to rehab. I said no no no…

What they wish to convey is can AI act like a computer? the team tried training a video model to generate simulation for terminal and desktop and got decent results. check more details :

paper :

u/Key-Currency1242

2 days ago

The tried to make me go to rehab. I said no no no…

Question regarding Arc-AGI-3 tests.

u/Kuuga2411

2 hr. ago

Question regarding Arc-AGI-3 tests.

Educational PyTorch repo for distributed training from scratch: DP, FSDP, TP, FSDP+TP, and PP

u/shreyansh26

17 hr. ago

Educational PyTorch repo for distributed training from scratch: DP, FSDP, TP, FSDP+TP, and PP

Just a helpful open-source contributor

I put together a small educational repo that implements distributed training parallelism from scratch in PyTorch:

Instead of using high-level abstractions, the code writes the forward/backward logic and collectives explicitly so you can see the algorithm directly.

The model is intentionally just repeated 2-matmul MLP blocks on a synthetic task, so the communication patterns are the main thing being studied.

Built this mainly for people who want to map the math of distributed training to runnable code without digging through a large framework.

Based on

u/MagicZhang

13 days ago

Just a helpful open-source contributor

LM Studio may possibly be infected with sophisticated malware.

u/bloomberg

•

Promoted

As AI triggers mass layoffs, China may have no choice but to expand its social safety net.

bloomberg.com

Learn More

u/mooncatx3

20 days ago

LM Studio may possibly be infected with sophisticated malware.

Is it normal for Gemma 4 26B/31B to run this fast on an Intel laptop? (288V / CachyOS)

u/No-Key8555

1 day ago

Is it normal for Gemma 4 26B/31B to run this fast on an Intel laptop? (288V / CachyOS)

It looks like we’ll need to download the new Gemma 4 GGUFs

5 days ago

It looks like we’ll need to download the new Gemma 4 GGUFs

A Mac Studio for Local AI — 6 Months Later

by :

We just updated them again in response to:

kv-cache : support attention rotation for heterogeneous iSWA
CUDA: check for buffer overlap before fusing - CRITICAL fixes <unused24> tokens
vocab : add byte token handling to BPE detokenizer for Gemma4
convert : set "add bos" == True for Gemma 4
common : add gemma 4 specialized parser
llama-model: read final_logit_softcapping for Gemma 4
llama: add custom newline split for Gemma 4

u/ezyz

1 day ago

A Mac Studio for Local AI — 6 Months Later

If you haven't yet given Gemma 4 a go...do it today

https://spicyneuron.substack.com/p/a-mac-studio-for-local-ai-6-months

u/No-Anchovies

1 day ago

If you haven't yet given Gemma 4 a go...do it today

Gemma4-31B worked in an iterative-correction loop (with a long-term memory bank) for 2 hours to solve a problem that baseline GPT-5.4-Pro couldn't

I have a modest rig that allows me to run Qwen 3.5 27B or even 35B via Ollama. Qwen has been amazing to work with and I've been fine with the slow drip trade-off.

Then Google released Gemma4.

Its fast - like 4 or 9B fast. Accuracy and confidence wise, reminds me of that first release of Gemini Pro that could actually produce code that would run.

As a "local guy" this shift in useability and confidence for a small self hosted LLM reminded me of what Deepseek brought to the table years ago with the thinking capability.

Give it a go when you have a chance, and apply the settings that google recommends, it does make a difference (slightly slower but better)

I tried a few releases and this one worked the best for all the tests I threw at it with law interpretation, python, brainstorming & problem solving.

bjoernb/gemma4-26b-fast:latest (not affiliated with whoever made this)

in the next few days I'll start checking the abliterated versions to see how they stand with pentest & sysec tasks vs Qwen

u/Ryoiki-Tokuiten

5 days ago

Gemma4-31B worked in an iterative-correction loop (with a long-term memory bank) for 2 hours to solve a problem that baseline GPT-5.4-Pro couldn't

MiniMax-M2.7 GGUF Quants — Full Set (Q2_K to Q8_0 + BF16)

u/Asleep_Training3543

1 day ago

MiniMax-M2.7 GGUF Quants — Full Set (Q2_K to Q8_0 + BF16)

Just finished quantizing MiniMax-M2.7 to GGUF. All standard quant levels available:

- BF16 (~427 GB)

- Q8_0 (~243 GB)

- Q6_K (~188 GB)

- Q5_K_M (~162 GB)

- Q4_K_M (~138 GB)

- Q3_K_M (~109 GB)

- Q2_K (~83 GB)

Help my llm isn't llming

u/Nicking0413

1 day ago

Help my llm isn't llming

Alibaba confirms they are committed to continuously open-sourcing new Qwen and Wan models

u/TKGaming_11

22 days ago

Alibaba confirms they are committed to continuously open-sourcing new Qwen and Wan models

Qwen 3.5 "Weight Drift" Fix? Automated Tool + Inconclusive NIAH Results

u/Decivox

1 day ago

Qwen 3.5 "Weight Drift" Fix? Automated Tool + Inconclusive NIAH Results

What are people's fave local model setups for home?

https://github.com/decibuild/qwen-ssm-repair

u/Competitive_Beat_915

•

Promoted

要塞を探索せよ。能力を解放せよ。子供を守れ。

store.steampowered.com

Learn More

u/styles01

3 hr. ago

What are people's fave local model setups for home?

How it started vs How it's going

After much much much testing of various models for: Openclaw, Hermes, Claude Code, and 'random creative requests' - here is my currently working setup.

For Claude Code/Openclaw.

I use AIRun to override Claude's model to Ollama, using GLM 5.1:cloud - i find this to be the best. Openclaw defaulting to the same. It's a bit slow, but way more reliable than Minimax - I find Minimax is way more likely to be a cowboy and do stuff you didn't ask or want it to do.
Local big model: Gemma4-26B-q4 - this thing is amazing. Performance through the roof locally on a M4Max, and it doesn't use up a zillion tokens on reasoning like Qwen does. Great for coding and reasoning locally. This is my local workhorse now.
Creative tasks: Joke-of-the-day, basic writing stuff - llama 3.2 3B - tiny, fast as f*** and does a great job and basic stuff. I find it to be the most creative and human of the models I've tested for creative writing.

I tried Qwen over and over but just had tons of issues, especially with too much reasoning (couldn't tweak it to low or medium) and just general performance.

Interested to hear your experiences.

u/HornyGooner4401

13 days ago

How it started vs How it's going

[Release] Carnice-9b-W8A16-AWQ – AWQ Quantization Optimized for vLLM + Marlin on Ampere GPUs (Single-GPU)

u/Imakerocketengine

20 hr. ago

[Release] Carnice-9b-W8A16-AWQ – AWQ Quantization Optimized for vLLM + Marlin on Ampere GPUs (Single-GPU)

Llamacpp on chromebook 4 gb ram

Hey ,

I am releasing my first model quantization: an 8-bit symmetric AWQ (W8A16) of , specifically optimized for Ampere GPUs (RTX 30-series) using vLLM with the Marlin kernel on a single-GPU inference setup.

kai-os/Carnice-9b is a specialized fine-tune of Qwen/Qwen3.5-9B that removes the visual components and adopts the Qwen3_5ForCausalLM architecture for pure text/agentic use (Hermes Agent harness). This architecture is not yet natively supported by vLLM (pending PR #39316).

To enable seamless loading, the quantized checkpoint re-wraps the weights into the Qwen3_5ForConditionalGeneration architecture (matching the original Qwen/Qwen3.5-9B configuration). This allows vLLM to serve it correctly with the --language-model-only flag for text-only inference.

Model:

Benchmark highlights (vLLM bench on random dataset, single RTX 3090 + Marlin):
• Average prompt throughput: ~1,994 tokens/s
• Average generation throughput: ~222 tokens/s

I'm gonna run some benchmarks specific to the Hermes agent environment (Terminal Bench Lite and YC bench). From a quick vibecheck it seems pretty good

Quick vLLM usage (single GPU):

vllm serve TurbulenceDeterministe/Carnice-9b-W8A16-AWQ \
  --max-model-len auto \
  --reasoning-parser qwen3 \
  --language-model-only \
  --tensor-parallel-size 1

I would greatly appreciate your feedback on how to improve future quantizations. Thank you!

u/Merchant_Lawrence

4 hr. ago

Llamacpp on chromebook 4 gb ram

Generation

Improving Language Models through Latent Reasoning?

u/ISeeThings404

1 day ago

Improving Language Models through Latent Reasoning?

Found this tweet online and wanted to see if anyone here had any opinions on it.

I'm an AI Researcher and have been exploring Latent Space Reasoning for a bit (mid-2024, really got into it when Meta published Coconut. This would check out in a few ways--

The perfdormance mentioned here.
The order-of-magnitude reduction when comparing Mythos and Opus 4.6 for BrowseComp.
General discussions from researchers in the space.

I've personally done some research into it, and I think it will be the future of AI and reasoning models. Too many reasons for it not to be (especially if we create a unified reasoning plane that models can plug in and out of). Too many reasons for it not to be. Wanted to get your thoughts on it, espcially if anyone else has tried it.

Did a bunch of experiments on it here, incase anyone is interested (would love to hear your experiences with it as well)-

Llama with FlexAttention

u/ss2642

19 hr. ago

Llama with FlexAttention

Is it just me or minimax-m2.7 is a regression in real world usage compared to minimax-2.5???

Hi everyone,

I am new to this community, this is my first blog post here (forgive if there are any mistakes).

I recently came across this blog post on pytorch website, , my understanding of what this does (please correct me if I am wrong): It generates custom triton kernels for various attention implementations, (some kind of compiler for attention), this helps save memory and latency during the scaled dot product attention computation, as this heavy work can be smartly offloaded to the GPU.

I found it very interesting and would like to use it in one of my projects, for this I need to integrate this to an actual LLM (say LLama3/3.1/3.2), since this provides only the attention computation, how can I integrate it with weights of an actual LLM? Almost all the tutorials I saw for flex attention generate random Q, K and V matrices for demonstration.

There is also an option of using something like `attn_implementation=flex_attention`, but then how do I use the `score_mod` and `mask_mod` attributes?

Is there some documentation, or a git repo doing this? Any guidance on how to approach this would help.

u/True_Requirement_891

23 hr. ago

Is it just me or minimax-m2.7 is a regression in real world usage compared to minimax-2.5???

Component Purgatory: 5090 to 6000 Pro Blackwell Upgrade Path Questions

I have been using the official api minimax-m2.7 and minimax-m2.5 in claude code since the first day of release and minimax-m2.5 always seems to complete tasks and figure things out faster than 2.7.

Minimax-m2.7 halucinates too much, and I haven't see any improvement when it comes to real world usage in literally any task, but I have noticed regression.
In terms of reliability 2.5 > 2.7

I have no idea why this is the case when it performs better on all benchmarks...

u/TankFirm388

4 hr. ago

Component Purgatory: 5090 to 6000 Pro Blackwell Upgrade Path Questions

DFlash: Block Diffusion for Flash Speculative Decoding.

I've been using a 5090 build as a hybrid PC (80% local LLM, 20% gaming). It is essentially a near-maxed out consumer setup (9950x3d, 128GB RAM).

I've recently decided to commit more to building some LLM workflows for my partner's local business (plus some other local colleagues) and have a new 6000 Pro Max-Q coming soon to expand to larger models w/ larger context (was able to get good business pricing + NVIDIA Inception discount).

I'm inclined to just add it to my current setup to upgrade the 'core' LLM portion of my usage. I'd keep the 5090 as a dev gpu for testing out new models and/or learning multi-model workflows, plus gaming. My only concern is that keeping the 5090 attached will handicap the 6000 by cutting the PCIE bandwidth of my mobo in half (x8/x8 vs x16).

I've also been tempted to just sell the 5090 and get another 6000, but that seems to overshadow the rest of the machine (would likely want 256GB RAM, plus same PCIE conundrum)

I do like the hybrid-ness of the current setup and potential of a 6000/5090 since it shares costs across multiple budgets (gaming, hobby/learning, business), but feels like I'm reaching a max point of those activities starting to interfere with each other.

Does anyone have a similar build and like it? Is this a dumb 'trying to do everything' machine that I should commit one way or another on? At what level does a machine have to move on from consumer components?

u/Total-Resort-3120

6 days ago

DFlash: Block Diffusion for Flash Speculative Decoding.

4B models on smartphone

u/Sudden_Vegetable6844

19 hr. ago

4B models on smartphone

Are local 4B models usable on smartphone?

Just did a vibe check on a Pixel Pro 10, Gemma 4B vs Qwen 3.5 4B, starting from handheld photos of ninth grade STEM tests (written in French, I asked in English, and both models replied in English)

Gemma 4 E4B via Google AI core runs on NPU: quite fast, energy efficient, but hallucinated about half the text from the image and failed. When the tests were manually entered as text, it gets most of them right.

Qwen 3.5 4B Q4_K_M via PocketPal (llama cpp under the hood) not only got all the text right, it also passed all the tests without errors. But, phone got very hot, and then it would slow down to a crawl after a couple hundred tokens (but would regain speed when allowed to cool down, even on long context)

Interestingly enough, the Qwen model is slightly smaller (3.4GB vs 3.6GB), if it would get NPU support and basic tools, I suspect it could cover everyday AI needs locally...

u/Anlatan

•

Promoted

Your story, effortlessly written.

novelai.net

Learn More

Minimax 2.7: good news!

u/LegacyRemaster

7 days ago

Minimax 2.7: good news!

Best model for translation between languages?

u/pragmojo

19 hr. ago

Best model for translation between languages?