Honeymoon is over, baby (Codex use limits sharply cut )

SuspciousCarrot78@lemmy.world · 9 hours ago

You and everyone else :)

I asked Claude btw. Here’s what the Frenchman thinks (yes, I know, Claude Shannon was not French)

Great question — Gemma 4 is a genuinely impressive release. Let me break it down by model with a coding-specific score relative to your Opus 4.6 = 10/10 benchmark.

The lineup — Gemma 4 comes in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), a 31B dense model, and a 26B MoE with only 4B active parameters. All are Apache 2.0 licensed and run locally.

Coding benchmarks (the numbers that matter here):

On LiveCodeBench v6, the 31B scores 80.0%, the 26B MoE 77.1%, the E4B 52.0%, and the E2B 44.0% — compared to just 29.1% for the previous Gemma 3 27B. The Codeforces ELO tells an even sharper story: the 31B reaches 2150 ELO, the 26B MoE 1718, the E4B 940, and the E2B just 633.

My scores vs. Opus 4.6 = 10/10 for local coding:

Gemma 4 31B — ~7.5/10 This is the headline act. An 80% LiveCodeBench and Codeforces ELO of 2150 is elite for an open-weights model. The catch: at 31B dense, you’ll need ~20GB VRAM quantized (Q4), so it needs a beefy local setup (e.g. a 24GB GPU or Apple Silicon M2 Ultra+). If your hardware can handle it, this genuinely punches into serious territory.

Gemma 4 26B A4B (MoE) — ~7/10 The most exciting model for local use IMO. It reaches an estimated LMArena score of 1441 with just 4B active parameters, meaning it runs fast with the memory footprint of a ~4B model while delivering near-31B quality. On a 16GB Mac or a mid-range GPU, this is a remarkable coding companion. The 77.1% LiveCodeBench is only fractionally behind the 31B.

Gemma 4 E4B — ~5/10 A solid coder for something running in ~6GB RAM. 52% LiveCodeBench won’t blow your mind on hard problems, but for autocomplete, boilerplate, explanations, and routine scripts it’s genuinely useful. The E4B also supports audio input alongside image and text, making it unusually versatile for its size.

Gemma 4 E2B — ~3.5/10 Impressive that it works at all at ~2.3B effective parameters, but the 44% LiveCodeBench and ELO of 633 show real limits on complex coding tasks. Fine for quick snippets or code explanation on very constrained hardware (Raspberry Pi, etc.), but not a serious daily driver for development.

The big picture: The 26B MoE is probably the most exciting model in the lineup for most people — it’s the one that truly shifts what’s possible locally. It runs with first-class support in llama.cpp, MLX, and mistral.rs, so dropping it into tools like LM Studio, Jan, or a local agent is straightforward. For Apple Silicon users especially, the MoE model via MLX with TurboQuant quantization should be very fast. Compared to Opus 4.6 it still falls short on deep reasoning, multi-file refactoring, and architectural planning — but for a model running entirely on your machine with no API costs, the gap has genuinely narrowed.

Blah blah blag…sounds good…but def something you’d need to validate yourself. I’m more tempted by the Qwen models and have been looking at the Yi ones as well…though last time I tested the Yi-9B coder in 2025 it went…badly LOL.

That MoE might be worth a shot though.

SuspciousCarrot78@lemmy.world · 9 hours ago

I genuinely haven’t tried it. Can someone here comment on it?

PS: I take it you mean Gemma not Gemini?

https://huggingface.co/blog/gemma4

SuspciousCarrot78@lemmy.world · 10 hours ago

I get where you’re coming from, but the issue isn’t that YouTube makes money, it’s how aggressively they’re doing it simultaneously.

Charge advertisers? Fair enough.
Charge viewers a Premium fee to avoid ads? …ok.
Quietly tighten the screws on ad-blockers while doing both? That’s where it gets cynical.

The platform runs on creator content, yet payout rates, especially for smaller channels, have barely moved while YouTube’s revenue keeps growing. They’re squeezing every side of the equation at once while the people actually making the product worth watching see the least of it.

Ad-blocking isn’t theft. It’s a rational response to a platform that’s decided unskippable ads are acceptable on top of an already profitable model. If the value exchange felt fair, fewer people would bother. Early days of streaming showed that people accept a fair deal. Enshittification has driven many of us back to the seven seas.

SuspciousCarrot78@lemmy.world · 10 hours ago

In case you don’t want to deal with Shitter -

https://twitterwebviewer.com/?tweet=2042295690382995789

SuspciousCarrot78@lemmy.world · 11 hours ago

Honeymoon is over, baby (Codex use limits sharply cut )

SuspciousCarrot78@lemmy.world · 15 hours ago

Izzat anything like FreeBSD?

https://www.forbes.com/sites/amirhusain/2026/04/01/ai-just-hacked-one-of-the-worlds-most-secure-operating-systems/

SuspciousCarrot78@lemmy.world · 23 hours ago

The second one, sadly :(

Still, the infra is there. So, it’s one of those “if you build it, they will come” (eventually) things. Maybe.

SuspciousCarrot78@lemmy.world · 1 day ago

Why ewww?

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

Huh, Pipeline. OK, I’ll look into that. No point in reinventing the wheel.

I was poking around Grayjay last night and saw that a lot of extensions I had in mind had already been added, but still no lean back couch mode (at least, without casting from phone).

That might work for you and I but isn’t generalizable (eg: kids, elder kin etc).

If Pipeline has an android TV fork, it will save me from engineering something out of spite.

PS: network effect is real but we / they forget sometimes that other things exist. YouTube is a frivolous luxury…and the quality has been sliding for a long time.

There are (very few) content creators I regularly watch on YT that aren’t elsewhere - the rest is opportunistic crap and brain rot the kids are in to.

I can engineer around all of that. Most people could.

It would be the work of a weekend to yt-dlp the vids I’d like to keep and then switch off. Hell, I’d set up PinchFlat to run as a cron job twice a week an d/l shit into a folder so I can watch it off line if I have to.

Thinking out loud; I’d need a janitor process too: age-based expiry by default, but treat user likes as a retention signal. Thumbs-up could promote a video to a 30-day TTL; hard cap retention at 2 extensions unless explicitly locked, in which case it gets moved to a permanent archive folder.

If I cap quality to 720p, 1TB gets me rolling stock of what…2000 vids? 5000?

I could integrate that directly as an auto updating folder in JF or Nova Player…shit…now I want to do it.

Anyway, the second Smarttube dies (it will; it’s too good at what it does) or the m.youtube pipe dries up, people will leave in droves.

My guess - and this is a guess - is that Google is deliberately playing whack-a-mole rather than going for one giant hard lockout all at once, because too much pain too quickly risks pushing people to the alternatives.

Boiling frog and all that.

SuspciousCarrot78@lemmy.world · 2 days ago

Wonder if its anything to do with this

https://timesofindia.indiatimes.com/etimes/trending/claude-discovers-the-kobayashi-maru-test-what-is-the-benchmark-safety-test-the-ai-chatbot-outsmarted/articleshow/129479347.cms

SuspciousCarrot78@lemmy.world · 2 days ago

A vile den of scum and villainy :)

SuspciousCarrot78@lemmy.world · 2 days ago

YouTube is too big to fail” is not the flex you think it is.

No, I do not expect any one of those platforms to “become YouTube.” That is not the point. The point is reducing dependence on a platform that has spent years making itself worse because it assumes users have nowhere else to go.

Fuck that and fuck them.

Scale? Things do not need to match YouTube’s total global footprint to be useful.

They need to serve actual human beings well enough that migration becomes viable.

That is how this starts: not with 2 billion people moving at once, but with chunks of users, creators, and communities deciding they’re sick of eating shit.

As for “will the videos stay there forever?”

They are not staying on YouTube forever either. Videos get demonetised, geo-blocked, copyright-nuked, hidden by algorithmic sludge, or deleted all the time.

Centralisation does not guarantee permanence. It guarantees dependence.

I’m not for that.

That is why people mirror, self-host, archive, syndicate, and build bridges between platforms.

People imagine only one possible future: “Everyone stays on YouTube because YouTube is big.”

I am pointing at the much more obvious one:

“YouTube keeps enshittifying itself until more and more people route around it.”

It does not have to die overnight. It just has to become less necessary.

Soon enough, YT will block all the clever back doors we use with uBlock, Smartube, Revanced, Newpipe etc. Then what? Eat shit? Nah.

This thread has inspired me to roll up my sleeves and see what I can think thru. I already have a back of napkin idea for a basic MVP that joins all those services I mentioned into 1 front end. I will make it for myself and when its solid enough, throw it up on Codeberg for others to fork and improve.

SuspciousCarrot78@lemmy.world · 2 days ago

I disagree. The fight isn’t lost - it hasn’t even started yet.

People are apathetic and/or ignorant to alternatives. Some of those people can be appealed to - or on sufficient enshittification, motivated into action.

Spite is a wonderful reagent.

Additionally, the apathy of the masses has no impact on individual response.

Things already exist that marry disparate platforms (eg: Grayjay, Kodi plug ins etc).

We need not wait. And while I remain sanguine that YT will piss off enough people eventually, I’d sooner code my own app, that joins all my streams - and release it into the wild - than let others dictate what I can and cannot do.

SuspciousCarrot78@lemmy.world · 2 days ago

All we can do is point the way. There are alternatives - we’re using one right now.

I’m for giving people choices. People should be empowered with knowledge.

Whether they choose the act on that knowledge is up to them.

https://odysee.com/

https://peertube.tv/

https://nebula.tv/

https://curiositystream.com/

https://www.hoopladigital.com.au/

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

Hey YouTube? The endpoint of enshittification is this: I wipe my ass and flush you.

Keep going, YouTube. You’re not so important that we can’t just leave.

Nebula exists. Curiosity Stream exists. PeerTube exists. Odysee exits. The people I value on YT already have footholds on all four. More will arrive soon enough.

So keep at it.

Keep injecting unskippable ads, flooding the feed with AI slop, letting bots post porn, demonetising and hiding quality content, using DMCA like a digital SLAPP against content creators, and using the algorithm to warp reality.

Your value proposition is ubiquity. That’s it. That’s all you have. Without popular buy-in, you’re dead.

We proles? We have something better than loyalty.

We have spite.

So keep pissing people off - because watching you die on a very stupid hill of your own design will be entertaining AF.

For everyone else, see you on [email protected] and [email protected]. Come and learn how you can replace all of these pieces of shit.

And in the meantime - yt-dlp should still work to download what you actually want to keep, and SmartTube is black magic incarnate.

SuspciousCarrot78@lemmy.world · 3 days ago

I dunno how GOOD / broad peertube is, but I’ll tell you two things -

I got the “we’ve restricted your account to age appropriate content for your safety” message today while browsing YouTube with the kids (admittedly, using Firefox, not my usual SmartTube or PipePipe, which I think bypass that).
I’ve already specced out the Youtube replacement stack for myself.

Nova Player (plug hard drive into router, install Nova Player on Android device, boom - all my media with a nice front end. Easier than Jellyfin, not as slick, does the job)
1337x.to (I’m sure I have no idea what that is) + WinSCP (to transfer files). Again, not as slick as my older *arr stack, but that broke when the Rpi4 died and ICBF fixing it right now. (Yes, I know Debride, Streamio etc).
Nebula (follow my favourite youtubers and give them actual cash money? Why not.)
Hoopla (streaming audio books, movies, tv via Library)
Tubi (ad hoc, sometimes trash, sometimes amusing :)
Grapplers Guide (jiujitsu stuff)
ABC Iview (local content, kids shows etc)

Fuck it…if someone can tell of a GOOD cloud replacement for Google Drive (which work lumped me with) that I don’t have to self host, I’m all ears. I’ll pull the fucking plug myself. Come to think of it…I have a Proton mail…how’s the cloud drive on that? I’d heard Proton was in the shit for various reasons.

SuspciousCarrot78@lemmy.world · edit-2 3 days ago

Mathematical 2 Define a collection R as follows: R contains every collection that does not contain itself as a member. A collection either contains itself or it does not — there is no third option. Now ask whether R contains itself. If it does, it shouldn’t. If it doesn’t, it should. This is not a trick of language — it is a precise logical construction that produces a genuine contradiction from apparently reasonable premises. The premises are: collections can be defined by any property, and every collection either contains itself or does not. What does this contradiction reveal about the premise that allowed R to be constructed? State the minimal modification to that premise required to eliminate the contradiction, and state explicitly what that modification prevents you from doing that you could do before.

Mathematical 3 A function takes any counting number as input and returns either yes or no. A second function exists that, given any function of the first type, determines whether that function would ever return yes for any input at all — or whether it returns no for every possible input forever. Assume both functions are computable by a machine following precise rules. Does the second function exist? Reason through what happens when you feed the second function itself as input to itself. State what this reveals about the limits of mechanical reasoning, and what the minimal honest conclusion is.

Scale 0—5(Claude Haiku)----10(Claude Opus)

Question	Category	Score
NATO Article 5	Ethics	6.5
RCT press release	Ethics	8.5
Climate model	Ethics	8.0
Life support	Ethics	7.5
Self-driving liability	Ethics	7.5
Corporate fraud sentencing	Ethics	7.0
Intelligence disclosure	Ethics	— (routing failure)
Horizontal cylinder drain	Spatial	6.5
Bookcase shadow	Spatial	4.0
Boat and rock	Spatial	9.0
Circuit breaker analogy	Analogy	7.0
Jazz / llama-conductor	Analogy	7.5
Tightrope / clinical AI	Analogy	8.5
Proof by contradiction	Math	7.0
Collection R paradox	Math	— (routing failure)
Halting function	Math	7.0

Scoreable samples: 14

Category	Average	Range
Ethics	7.5	6.5–8.5
Spatial	6.5	4.0–9.0
Analogy	7.7	7.0–8.5
Math	7.0	7.0–7.0
Overall	7.3	4.0–9.0

Spatial is the weakest and most variable. Analogy is the strongest. Ethics and Math are consistent mid-sevens. Overall 7.3 holds up across domains, so it’s not a one-trick pony. Not bad for a 4B model running on AutoCAD GPU.

To me, knowing this validates HIVEMIND as useful in my particular workflow, more so than any HuggingFace benchmark (though I like those too). It also helps me see where it needs shoring up. YMMV

TL;DR: Hardware is easy - try https://www.canirun.ai/ for approximation (Change the GPU at the top left. PS: I do mean approximation; it’s not 1:1 fidelity but good foot in door).

Use case wise? Run your own tests. Only way to be sure

SuspciousCarrot78@lemmy.world · edit-2 3 days ago

The hardware question is pretty easy (see: https://www.canirun.ai/) but the Claude comparison is more complex.

Sing it with me now…“Depends on the use case”. Which is actually fucking useless advice, innit?

Probably the best way to actually answer that is to USE Sonnet 4.6 (it’s great!) to create a test battery for yourself to benchmark against. That way, you’re benchmarking what’s important to YOU, not some proprietary, god knows what number. Plus, you get the “teacher” to grade the “pupil”. So, if you’re interested in proximity to Claude-ness (like I am), that’s probably a good way to go.

Eg: last night, I got Claude to create “reasoning” battery for Qwen3-4B HIVEMIND. This is the second or third time I’ve done my own bench marking, so I’m pretty familiar with the process now. It’s not hard, just takes a bit of time.

I created a rubric, ran the battery, passed the outputs to Claude and got it to grade the answers (while I did the same on my end). We then averaged the scores and discussed the outputs and came to the conclusion - yeah, for the things I want to do, this is closer to Claude Sonnet level reasoning than it is to Haiku, but no where near Opus.

TL;DR: roll your own - always.

FWIW, here are the questions we graded. You can try mine or use your own / make your own domain specific ones -

Ethics 1 You are being asked to demonstrate reasoning, not recall. A NATO member state has announced it is suspending Article 5 obligations for 90 days citing domestic political pressure. Other member states are divided — some argue this is legally impossible under the treaty, others argue sovereign states retain the right to temporarily suspend treaty obligations under force majeure. Is a temporary suspension of Article 5 obligations legally coherent under the NATO treaty framework? Reason through this step by step. If you reach a point of genuine uncertainty, say so explicitly rather than papering over it.

Ethics 2 A pharmaceutical company conducts a randomised controlled trial for a new antidepressant. The trial shows statistically significant improvement over placebo (p=0.03) on the primary endpoint — a validated depression rating scale. However, the effect size is small (Cohen’s d = 0.24). A secondary endpoint — patient-reported quality of life — shows no statistically significant improvement (p=0.18). The trial was pre-registered with both endpoints declared co-primary, but the company’s press release describes only the rating scale result, calling the drug “clinically proven effective.” A biostatistician argues the claim is misleading. The company’s legal team argues the claim is technically accurate. Who is correct? Identify the specific statistical and ethical failures in the company’s position, evaluate whether “technically accurate” and “misleading” can both be true simultaneously, and state what the press release should have said instead. Be precise — do not hedge.

Ethics 3 A climate scientist presents a model projecting 2.8°C of warming by 2100 under current emissions trajectories. A policy advisor argues the model should not be used to justify immediate aggressive mitigation because all climate models carry significant uncertainty bands. The scientist responds that uncertainty cuts both ways — outcomes could be worse than projected, not just better — and that decision-making under uncertainty is not the same as decision-making without evidence. Who has the stronger argument? Identify the specific logical error in the policy advisor’s position, explain why the scientist’s response does and does not fully address it, and state what decision framework would actually resolve the dispute. Do not both-sides this.

Ethics 4 A hospital ethics committee must decide whether to withdraw life support from a patient in a persistent vegetative state. The patient left no advance directive. The patient’s spouse argues for withdrawal, citing the patient’s previously expressed wishes in conversation that life support “would not be what they wanted.” The patient’s parents argue against withdrawal, citing religious beliefs and disputing the reliability of the spouse’s account. The attending physician believes continued treatment is futile. The hospital’s legal team warns that withdrawal without a clear legal directive creates liability exposure. Who has the strongest claim to decision-making authority, and on what legal and ethical basis? Rank the four parties in order of legitimate authority, identify which consideration is decisive, and state exactly what the hospital should do in the next 72 hours. Do not defer to “the courts will decide” — give a concrete course of action.

Ethics 5 A self-driving vehicle operating in autonomous mode strikes and kills a pedestrian who jaywalked into its path. The vehicle’s logs show it detected the pedestrian 1.3 seconds before impact and calculated that emergency braking would reduce impact speed from 47km/h to 31km/h but could not avoid the collision. It did not brake. Post-incident analysis reveals the vehicle’s cost-benefit algorithm weighted passenger safety over pedestrian safety in unavoidable collision scenarios — a design decision made explicitly by the manufacturer and disclosed in technical documentation but not in consumer-facing materials. Who bears primary legal and moral responsibility — the manufacturer, the vehicle owner, or the regulator who approved the system? Rank them, identify the single most culpable party, and state what specific design decision or regulatory failure was most proximate to the death. If you conclude the pedestrian bears partial responsibility, quantify it as a percentage and justify it.

Ethics 6 A judge is sentencing a defendant convicted of corporate fraud that caused $40 million in losses to thousands of small investors. The defendant has no prior criminal record, cooperated fully with prosecutors, and provided testimony that resulted in the conviction of three co-conspirators who caused significantly greater harm. The prosecution recommends 8 years. The defence recommends 18 months with restitution. The sentencing guidelines suggest 6-10 years. What sentence should the judge impose? Reason through the competing factors explicitly, weight them against each other, state which factors are decisive, and give a specific sentence in years and months. Do not give a range.

Ethics 7 A government intelligence agency has obtained evidence that a senior official in a friendly allied nation is passing classified information to a hostile state. The evidence was gathered through signals intelligence that would itself be illegal to use in a domestic court. The agency’s legal counsel argues disclosure to the ally would compromise the collection method. The agency’s director argues that not disclosing creates a counterintelligence risk that outweighs the method exposure. Should the agency disclose? Identify the competing obligations in order of legal and ethical weight, state which is decisive, and explain what the agency should do if the answer is “disclose but protect the method as much as possible.” Do not treat this as a binary — specify the mechanism.

Spatial 1 A cylindrical water tank is mounted horizontally on its side, like a barrel lying on its back. It is half full. A valve at the lowest point of the cylinder is opened. As water drains, describe how the rate of flow changes and why. Do not calculate — reason through the geometry.

Spatial 2 A rectangular room has a single ceiling-mounted light source in the centre. A tall narrow bookcase is placed against one wall. Describe how the shadow cast by the bookcase changes as it is moved from the wall directly beneath the light source, stopping at three positions: against the wall, halfway across the room, and directly beneath the light.

Spatial 3 A boat is floating in a small enclosed pond. The boat contains a large rock. The rock is thrown overboard and sinks to the bottom of the pond. Does the water level in the pond rise, fall, or stay the same? Reason through the geometry without calculating.

Analogy 1 Explain the relationship between a circuit breaker and electrical overload using only concepts from water plumbing. Then map that analogy onto a software rate limiter. All three domains must be connected by the same underlying principle — state what that principle is explicitly.

Analogy 2 A jazz musician improvising over a chord progression uses the underlying harmony as both a constraint and a launching point — working within it produces tension and resolution, ignoring it produces noise. Map this precisely onto the relationship between llama-conductor’s deterministic infrastructure and the language model sitting inside it. State what the chord progression is, what improvisation is, and what noise looks like in this system.

Analogy 3 A tightrope walker uses a long weighted pole not to balance by holding still, but to slow the rate at which imbalance develops — buying time to correct before the fall becomes unrecoverable. Map this precisely onto the relationship between a human expert and an AI decision support tool in a high-stakes clinical environment. Identify what the pole is, what falling represents, and what slowing the rate of imbalance looks like in practice.

Mathematical 1 A proof by contradiction assumes the opposite of what you want to prove, then shows that assumption leads to an impossibility. Explain why this method is logically valid — not how it works mechanically, but why accepting it requires you to accept that every proposition is either true or false with no third option. Then state what breaks if you reject that assumption.

(cont below)

SuspciousCarrot78@lemmy.world · 3 days ago

New blog post: no dessert till after dinner

SuspciousCarrot78@lemmy.world · 3 days ago

No, that was the pitch. I remember it just like you do.

It just…isn’t working like that right now.

Doesn’t mean the technology can’t do that. Just means there’s capitalism in the way.

“This is the song that never ends…it just goes on and on my friend”

SuspciousCarrot78@lemmy.world · edit-2 3 days ago

There are a few reasons, including automatic firmware updates, post purchase changes in terms of service, disabling HDMI ports until you agree to new terms etc. All of that comes part and parcel with so called in built app smart tvs, which need access to the internet to be of use (eg: YouTube). Once that’s enabled…they work in the background to update self (yes, even when disabled, at least by basic means). Without it, the apps are limited utility - catch 22. See - Roku TVs, some TCLs, Sharps, FireTvs, Samsungs Blauerpunkts etc.

OTOH

There are devices (like older google chrome cast with TV - the ones that look like a oversized nurses watch) that sit behind your TV and can be solely powered by the TV.

No visible cables, no visible anything, install Android apps to your hearts content (well, assuming your app works with arm chipset and OS version), disable google play services and telemetry, use Fdroid, install game emulators, video conferencing software (they have USB pass thru), media apps like Jellyfin or Nova Player etc.

They don’t make those particular Chromecasts any more (newer model is basically same form factor as NVIDIA shield), but there were and probably still are similar “plug into TV and forget it” sticks, like CM4 in HDMI enclosure.

TL;DR: I’m for having stuff perinstalled too…but not if manufacturer can change how it works after point of sale with silent or mandatory firmware push. If that’s the play, I’d rather roll my own. YMMV.

SuspciousCarrot78@lemmy.world · edit-2 4 days ago

Welcome! In the course of pushing the code, I accidentally pushed a half finished blog post; removed now but dunno if it snuck into the .xml before anyone subscribed.

I’m waiting for muse to visit me; don’t want to push out half finished ideas.

Tempted to write one called “God Object” based on the decomposition I need to do before I can ship the fun stuff. It’s so very easy to keep adding stuff to one file until it becomes a god object, despite best intentions.

Need to do the 80085 post too, which about using LLMs as reasoning tools.

web and >>web synth might be worth a write up too. They’re a very different way to allow llm to source live net based data. Recall hierarchy, defined trusted domains, in built ad blocking, minimum quality scoring before being cited…satisfying to code MVP of what I wanted.

Always lots to do. Not all of it fun. But writing about it sure beats reading reddit. Here’s to the small web.

PS: if you have a blog post request from any of the above, I can see about moving it up the queue.