The Generative AI Bubble Is really Going to Pop - Part Deux

dzid

Ars Centurion
2,653
Subscriptor
I ended up going deeper and longer than I expected, but if you made it this far I would welcome any thoughts, corrections, or different perspectives!
That's interesting. I'd like to know if you think that fits at all into the land-grab I wrote about, posts 54-57,60. This is a wild guess, but I'd think the inference losses are getting obscured. I may be able to provide deeper information if someone expresses serious interest.
 

hanser

Ars Legatus Legionis
42,522
Subscriptor++
My supposition is that the other Services are subsidizing LLM activities for now, but Gemini 3.5 just dropped the other day, and Opus 4.5 dropped a day or two after that. The marginal cost of an Opus 4.5 token is half what Sonnet 4.5 (which came out in Sept-ish?) was. So we're seeing halving of token costs a couple of times a year now.

For Google, I suspect Gemini's per-token costs are also dropping pretty steeply with hardware + software innovation. So the ratio in late 2024 is almost certainly completely different than late 2025. The embedded interface toolsets are getting better at switching between models based on the nature of the work the subsoil is doing. (Claude Code, for example, will pair Opus as the big brain with sub-agents using Haiku, which is a very, very cheap model, and it's not worse off for doing so.)

If a company can keep per-customer revenue stable, and not fall too far afoul of the Jevon's paradox, they could start to see the financial picture improve. For my case, I was never short of tokens at $100/mo, and I'm not really using more tokens now than I was 3 months ago(*), so Anthropic's just improved their finances with respect to people like me who are heavy users.

Some more on token costs:
https://arstechnica.com/civis/threads/adventures-coding-with-ai.1501554/post-44106198

(*) But my experience with tools like CC isn't meaningfully changed from 3 months ago, either. They're just as capable and boneheaded as they've always been, modulo edges here and there. IOW, these models aren't really changing my life in any appreciable way.
 
Last edited:

ramases

Ars Tribunus Angusticlavius
8,436
Subscriptor++
I am going to go with the 'ad/non-AI cloud services revenue money printer goes brrrrr' theory of profitability.

Also, I question the level of questioning of extending server/networking depreciation lifespan. Modern DC hardware is easily reliable enough for a 6 year lifespan, and for general purpose compute there were no real game-changers in the last 4 years that'd necessarily warrant the cost of replacing now than in 4 years.

Especially in a cattle-not-pet setup where failure is an expected occurence as long as its frequency does not rise above forecast.

I see no fundamental reason why a 2023 business decision of "the hardware we bought in 2019 has at least two more economically useful years of life in them, so we're going to keep using it" would be out of the ordinary; and if the business decision is fine, so likely is the accounting decision.

I suspect there's also a limit to how large an infra deployment/replacement/renewal capacity (which isn't free, and all the categories usually compete for the same resources) they want to retain internally; also, based on how quickly they are planning on ramping up infra investment there's probably something to be said for reducing secondary workloads. If you go from 4 to 6 year refresh periods for general purpose compute you just halfed your infra replacement work load for that segmet.
 
  • Like
Reactions: flere-imsaho

dzid

Ars Centurion
2,653
Subscriptor
Glad to find this thread. I just made a semi-deep dive into Google's AI-related financials after reading the "Google must double infrastructure every six months thread" and thought this might be of interest here. Copying it over verbatim below. Link to my post: https://arstechnica.com/civis/threa...onths-to-meet-ai-demand.1510433/post-44107973



I'm fascinated by all this so I went deep-diving into Alphabet (Google's) financials to try and figure out how much they actually disclose about their AI costs and revenues. They definitely obfuscate a lot of it, but I found a few interesting nuggets:

  1. Google places the cost of developing the Gemini application (the front-end) as a cost in Google Services, where Google Search lives. Highly profitable segment, and the cost of developing a single application (even an important one) is very modest, so this cost is completely buried and can't be estimated.
  2. The Google DeepMind team (which now includes all AI team members including all of Google Brain) now lives in the relatively new "Alphabet-level activities" (yes, that's exactly what it's called).
  3. Alphabet-level activities also includes AI R&D and new model training. (The justification, I suppose, is that Google is embedding AI in all of its products across the board so it's defensible to hold the expense at the "corporate parent" location.)
A good number of internet services, up to now, run without the assistance of LLM. A user has no need, for the purposes of what they do, for chatbots or LLM. Why incorporate LLM into products like that? To be clear, my concern is Google and others forcing LLM on users and providing no other option.

Here's something on Google Brain (which up to now I knew zero about) that I found:
Google Brain, as the name suggests, is meant to replicate, as closely as possible, the functioning of a normal human brain. And the team behind it has been largely successful in doing the same. In October 2016, the people behind the Brain tried to conduct a basic simulation of human communication between three AIs: Alice, Bob and Eve. The purpose was to have Alice and Bob communicate effectively – without Bob misreading Alice's messages and without Eve intercepting them or with Bob and Alice carrying out proper encryption and decryption, on their respective parts. The study showed that for every round where they failed to communicate properly, the next round showed a significant improvement in the cryptographic abilities of the two AIs.

Even though a normal person might think that cryptography as such is largely absent from normal human communication, nothing could be further from the truth. We communicate not only through words but also gestures – waves, eye rolls, and sighs.

So if Google Brain people are developing the models and associated technology to embed in its products, I'd find that unsettling on its own from a human psychology standpoint, never mind the concerns that arise from the ability to "nudge" people, in a good way or especially a bad way, and finally, in light of the increasing dominance of the AI landscape by corporate actors with no incentive other than profit. (see link in #161 for more info)
 

w00key

Ars Tribunus Angusticlavius
7,975
Subscriptor
In 2023 Google extended the depreciation lifespan of servers and network equipment to six years (it was four and five years, respectively), a questionable move. This has the effect of reducing yearly costs by about $4B.
If anything the previous timeline was too agressive. I myself still run workload on E2 instances. These are on Haswell, Broadwell, Skylake, Rome and Milan hardware platforms, launched 2014-2021. 6 years is more than fair.

I still have random strays on N1, 2012-2017 hardware.


For GPU, the ones being removed from production now are Pascal (P100, P4, 2016), Volta (V100, 2017), Turing (T4, 2018), but you can still rent them. Ampere (A100, 2010) are widely available and affordable workhorses for inference.

On GCP, P4, T4, and newer are widely available and in production, so when you can selling capacity on 10 year old hardware, a 6 year depreciation cycle is fine, maybe too agressive even, you don't want them to be zero on the books when still useful.
 
  • Like
Reactions: Pino90

Exordium01

Ars Praefectus
4,195
Subscriptor
Keep in mind with all of this that R&D spending doesn’t necessarily hit COGS so it won’t pull down gross margin.

Oracle does look pretty much cooked with negative free cash flow projected out to 2030 with obscene increases in revenue needed to balance out spending. The credit default swaps they have already issued means their ability to borrow more is already impaired.

Google and Microsoft may be burning the furniture to heat the mansions, but their spending problems aren’t yet an existential risk to their businesses.
 

MilleniX

Ars Tribunus Angusticlavius
7,672
Subscriptor++
Ampere (A100, 2010) are widely available and affordable workhorses for inference
(That should be 2020)

They're also still the best GPU with wide cloud availability for higher-precision numerical workloads. Everything since then from Nvidia has traded off lower throughput of 64 and 32 bit floating point operations for higher throughput of progressively less precise numbers. You can do useful inference at 8 or even 4 bits per weight. You can't do useful math like that, though.

If/when there's an actual decrease in demand for AI usage of cloud-available GPUs, those will resume being fully booked for deterministic use cases.
 

w00key

Ars Tribunus Angusticlavius
7,975
Subscriptor
(That should be 2020)

They're also still the best GPU with wide cloud availability for higher-precision numerical workloads. Everything since then from Nvidia has traded off lower throughput of 64 and 32 bit floating point operations for higher throughput of progressively less precise numbers. You can do useful inference at 8 or even 4 bits per weight. You can't do useful math like that, though.

If/when there's an actual decrease in demand for AI usage of cloud-available GPUs, those will resume being fully booked for deterministic use cases.
Ah yeah typo in my head. I usually read the forums during downtime on the phone, makes writing longer replies much harder than on a laptop.

A100 is indeed a fine working card and will be for a very long time. "Old school" workload doesn't need to chase the highest HBM size and speeds, nor do we need 8+ cards in the same logical system.
 

MilleniX

Ars Tribunus Angusticlavius
7,672
Subscriptor++
"Old school" workload doesn't need to chase the highest HBM size and speeds, nor do we need 8+ cards in the same logical system.
All of the other advances in memory performance, capacity, and closer packaging are absolutely useful at the higher end of those workloads. I'm saying that with a background of having run parallel jobs that spanned thousands of GPUs, though.
 

w00key

Ars Tribunus Angusticlavius
7,975
Subscriptor
Expanding on AI closing the knowledge gap.


I am now in Osaka, and my local knowledge is near zero. I can Google (actually Kagi), but my hit rate with English queries will be near zero. Like, what's a good tea shop? Uuh...

An LLM knows because it is fluent in Japanese what the local favorites are, when they are founded, what they specialize in, Matcha vs Sencha vs Hojicha, and present that back in English. Seafood options? Crab, sushi, pufferfish, izakaya with food seafood options, it knows them all.

What also works is, In Toyko, Kaitenzushi Nemuro Hanamaru Ginza, 回転寿司 根室花まる 銀座店 was amazing. What is similar here? => Hokkaido style, freshly flown in, focus on quality, upmarket kaitenzushi. Oh these are the options here...

I can find them manually, did so with many hours of Tabelog x Maps cross checking but not everyone can. And for non restaurants, like finding furikake and tea options, a big LLM is crazy amazing if you know how to query it. This will become a core skill, works for leisure and work.


You have to prompt well though, a few times I pasted in too many Japanese characters and it went "一保堂茶舗は大阪高島屋の地下1階(B1F)にあります。" on me lol. Yeah no I can't read it that well. Oh actually, okay this sentence I can. Huh. But not the following wall of text. Gemini Pro is wordy but very good for discovery and research. Gemini learned new tricks, before, it can show the results only as text, now also as embedded map, very handy.


Anything cross border, cross language, it is amazing. I knew that already from my Cantonese cuisine queries and using it as translation helper and cross programming languages too, know Java, Python, Typescript, want to learn Rust. ELI5.


But, and this is a big one - use the pro model. Flash is pretty trash. Even pro makes mistakes like showing a closed store, but it digs deeper using Google and Maps search and has very high accuracy.
 
Last edited:
  • Like
Reactions: Pino90