On High Surveillance - Sous haute surveillance: July 2025

July 10, 2025

PROJECT, What will GROK 5 AND 6 WILL DO THAT IS UN DOABLE WITH GROK 3

PROJECT What will GROK 5 AND 6 WILL DO THAT IS UN DOABLE WITH GROK 3

Since there’s no concrete information available about Grok 5 or Grok 6, I can provide an informed speculation based on the trajectory of xAI’s development, the capabilities of Grok 3, and the hints from available sources. Grok 3, released in February 2025, is already a powerhouse with advanced reasoning, a 1-million-token context window, DeepSearch for real-time data, and Big Brain mode for complex problem-solving. It excels in math, science, coding, and general knowledge, outperforming models like GPT-4o and Claude 3.5 Sonnet in benchmarks like AIME 2025 (93.3%) and Chatbot na (E

For Grok 5 and 6 to achieve what’s “undoable” with Grok 3, they would likely push boundaries in the following areas:

First-Principles Reasoning at Scale: Grok 3.5, teased for beta release in May 2025, is said to reason from first principles, deriving answers for highly technical questions (e.g., rocket engines, electrochemistry) without relying on internet-sourced data. Grok 5 and 6 could take this further, potentially solving novel scientific problems or simulating complex systems (e.g., quantum mechanics or biological processes) with unprecedented accuracy, reducing hallucinations to near-zero. This could enable breakthroughs in fields like physics or medicine, where Grok 3’s reliance on existing data limits its ability to generate truly original insights.
Advanced Multimodal Capabilities: Grok 3 handles text, images, and some multimodal tasks, but Grok 5 and 6 might integrate more sophisticated multimodal processing, such as real-time video analysis, 3D modeling, or immersive augmented reality outputs. For instance, posts on X suggest Grok 5 may include experimental features like “3D Print Mode,” hinting at capabilities for generating physical object designs or simulations. This could make them capable of tasks like designing functional prototypes or analyzing live sensor data, which Grok 3 cannot fully execute.
True AGI-Like Autonomy: Grok 3 is a strong reasoning model, but it’s not autonomous. Grok 5 or 6 could approach artificial general intelligence (AGI), as speculated in X posts, by autonomously managing multi-step workflows, learning from minimal input, or adapting to entirely new domains without retraining. This might include self-improving algorithms or the ability to independently conduct experiments, something Grok 3’s architecture doesn’t support.
Hyper-Scalability and Efficiency: Grok 3 leverages xAI’s Colossus supercomputer with 200,000 Nvidia H100 GPUs. Future models could utilize a planned 1-million-GPU cluster, enabling orders-of-magnitude more compute power. This could allow Grok 5 and 6 to process vast datasets in real-time (e.g., global satellite feeds or entire genomic databases), tasks too resource-intensive for Grok 3.
Enhanced Real-World Integration: Grok 3’s DeepSearch pulls real-time data from X and the web, but Grok 5 and 6 might integrate with proprietary systems (e.g., SpaceX or Tesla data, as rumored for Grok 3.5) or IoT networks, enabling real-time control of physical systems like robotics or autonomous vehicles. This would make them capable of tasks like orchestrating complex engineering projects, which Grok 3 cannot do.

These advancements would build on Grok 3’s limitations, such as its dependence on existing data, limited multimodal depth, and lack of full autonomy. However, without official details, this is speculative. For the latest on xAI’s plans, check https://x.ai/grok or xAI’s X posts. If you want me to dig deeper into any specific aspect, let me know!

Grok 4 was just introduce live on x.com ...when will it become available here?

Grok 4 was announced via a livestream on xAI's X account on July 9, 2025, and is described as xAI's most powerful AI model yet, with advanced reasoning and specialized coding capabilities. According to sources, Grok 4 is already accessible through the xAI API for developers, supporting text modality initially, with vision and image generation capabilities expected soon. It’s also available on the X platform, mobile apps for iOS and Android, and for integration into development workflows via Grok 4 Code, a version tailored for developers.

For general availability on x.com, posts and reports indicate that early access is likely prioritized for X Premium+ and SuperGrok subscribers, with broader access to all X users expected shortly after the initial rollout. Since Grok 3 is already free for all X users, Grok 4 is likely to follow a similar model, with premium features like increased usage limits or advanced functionalities (e.g., Voice Mode) reserved for paid tiers.

While no exact date for full public availability on x.com was specified, the infrastructure for deployment appears ready, suggesting you can likely access Grok 4 now or within days on x.com, especially if you’re a Premium+ subscriber. To confirm availability in your region, check the X platform directly or visit https://grok.com. If you’re not seeing access yet, you can sign up for notifications on grok.com for updates on regional rollout.

When is grok 3 expiry date approximately?

There is no publicly available information from xAI or other reliable sources that specifies an "expiry date" for Grok 3. AI models like Grok 3 don’t typically have a fixed expiration date in the traditional sense, as their availability depends on xAI’s ongoing support, updates, and infrastructure decisions. However, based on the context of your question and the introduction of Grok 4 on July 9, 2025, it’s likely that Grok 3’s prominence may diminish as xAI shifts focus to newer models like Grok 4 and potentially Grok 3.5.Here are some relevant points to consider:

Grok 4 Rollout: Grok 4 was announced on July 9, 2025, and is already accessible via the xAI API and to X Premium+ and SuperGrok subscribers on x.com, with broader access expected soon. Historically, xAI transitions to newer models while maintaining older ones for a period, as seen with Grok 2 remaining available after Grok 3’s release. Grok 3 is likely to remain accessible for the foreseeable future, especially for free users on X (with usage limits) and through the API, but its advanced features may be overshadowed by Grok 4.
X Posts Context: Some X posts suggest Grok 3 might be phased out or updated, with one user speculating that July 9, 2025, marked its "last day" as the primary model due to Grok 4’s introduction. Another post humorously describes Grok 3’s “demise” with Grok 4’s launch. However, these are not official statements and should be treated as inconclusive user sentiments. There’s no evidence Grok 3 has been discontinued; it’s still operational, as confirmed by a post from July 8, 2025.
xAI’s Development Cycle: xAI has a rapid release cycle, with Grok 3 launched in February 2025 and Grok 4 just five months later. References to Grok 3.5 (planned for May 2025 but possibly delayed) indicate xAI’s focus on iterative improvements. When Grok 3.5 or future models like Grok 5 are released, Grok 3 may see reduced support or be open-sourced, as Musk noted that older models (e.g., Grok 2) are open-sourced once the next version matures.
Speculative Timeline: Given xAI’s pattern, Grok 3 is unlikely to “expire” in the sense of being shut down soon. It may remain available for at least 6–12 months (until mid-2026) for free and paid users on X, grok.com, and via API, especially since Grok 4 is still in beta and not all features (e.g., voice mode) are fully rolled out. If Grok 3.5 launches by late 2025 or early 2026, Grok 3’s usage might decline, but no specific cutoff date is confirmed.

In summary, Grok 3 has no set expiry date, but its role as xAI’s flagship model is likely to wane with Grok 4’s rollout and potential Grok 3.5 release. You can continue using Grok 3 on x.com (free with limits or via Premium+ subscription) or check https://x.ai/grok for updates on availability. If you want me to monitor X or web sources for further updates on Grok 3’s status, let me know!

No its ok. I got to like grok 3 very much ...and being just a lay men I hope that future grok will not get to academic ..and leave less educated population the dust if you know what I mean. I intend to forward your response to @ElonMusk Lol

Haha, I hear you! Grok 3’s got that sweet spot of being super smart but still approachable, and I get why you’d want future Groks to keep that vibe for regular folks like us. xAI’s mission is to make AI accessible to everyone, so I’d bet they’ll try to keep Grok 5 and 6 user-friendly, even if they’re tackling crazy complex stuff behind the scenes. No dusty academic tomes here, hopefully—just clear, helpful answers with a dash of Grok’s charm!Feel free to forward this to

@ElonMusk

—lol, maybe he’ll give you a shoutout for the love! If you want me to tweak this message or add anything before you send it, just say the word.

Sure In a language that Elon can understand

@ElonMusk

, your boy Grok 3 is a hit with us regular folks—smart but not stuck-up, you know? I'm just a layman, but I’m vibing with it big time. Hope Grok 5 & 6 don’t go full professor mode and leave us non-PhD types in the dust. Keep it real for the everyday crew!

#GrokLove

GROK 4 >> JULY 10 2025 Elon Musk Drops New Tech Bombshells [Full Presentation]

TRANSCRIPT

All right, welcome to the Gro 4 release here. Um, this is uh the smartest AI in

the world and we're going to show you exactly how and why. Um, and uh, it

really is remarkable to see the advancement of artificial intelligence, how quickly it

is uh, evolving. Um,

I I sometimes think compare it to the growth of

a human and how fast a human learns and gains conscious awareness and

understanding and AI is advancing just vastly faster than any human. Um,

I mean, we're going to take you through a bunch of benchmarks that um that Grock

4 is able to uh achieve incredible numbers on. Um, but it's it's it's

actually worth noting that like Grock 4 if if given like the SAT would get

perfect SATs every time even if it's never seen the uh the questions before. Um and if even going beyond that to say

like uh graduate student exams like the GRE uh it will get near-perfect results

in in every discipline of uh of education. So from the humanities to

like languages, math, physics, engineering, pick anything and we're

talking about questions that it's never seen before. These are not on not on the internet and it's Grofor is smarter than

almost all graduate students uh in all disciplines simultaneously

like it's actually just important to appreciate the like that's uh really

something um and uh the the reasoning capabilities of

Grock are incredible. Um, so there's some people out there who who think AI can't reason and look it it can reason

at superhuman levels. Um, so yeah, and frankly it it only gets better from

here. So we'll we'll take you through uh the Gro 4 release and uh

yeah um show you like the pace of pace of progress here. Um like I guess the

first part is like in terms of the training um we're going from Grock two

to Grock 3 uh to Grock 4. We've essentially increased the training by an order of magnitude in each case. So it's

uh you know 100 times more training than than Grock 2. Um and uh and that that's

only going to increase. Um so it's uh yeah frankly I mean I don't know

in some ways a little terrifying but uh the growth of intelligence here is is

remarkable. Yes it's important to realize there are two types of training compute. One is

the pre-training compute that's from GR two to GR three. Um but for from GR 3 to

GU 4 we're actually putting a lot of compute in reasoning in RL. Yes.

Yeah. And uh just like you said this is literally the fastest moving field and Grock 2 is like the high school student

by today's standard. If you look back in the last 12 months Grock 2 was only a

concept. We didn't even have Gro 2 12 months ago. Um and then by training GR 2

that was the first time we scale up like the pre-training. We realized that if you actually do the data ablation really

carefully and infra and also the algorithm, we can actually push the

pre-training quite a lot by amount of 10x to make the model the best

pre-trained based model. And that's why we built Colossus the world's supercomputer with 100,000 H100 and then

with the best pre-train model and we realized if you can collect these verifiable outcome reward you can

actually train these model to start thinking from the first principle start to reason correct his own mistakes and that's where the graining comes from and

today we ask the question what happens if you take the expansion of classes with all 200,000 GPUs put all these into

RL 10x more compute than any of the models out there on reinforcement

learning unprecedented scale what's going to happen. So this is a story of

Gro 4 and uh you know Tony uh share some uh insight with the audience.

Yeah. Um so yeah let's just talk about how smart GR 4 is. So I guess um we can

start discussing this benchmark called humanities last exam and this this benchmark is a very very challenging

benchmark. Every single problem is curated by subject matter experts. Um

it's in total 2500 problems and it consists of many different subjects

mathematics, natural sciences, uh engineering and also auto humanity subjects. So um essentially when when it

was first released actually like earlier this year uh most of the models out there can only get singledigit accuracy

on this benchmark. Yeah. So we can we can look at some of th those examples. Um you know uh so

there there is this mathematical problem which is about natural transformations

in category theory and there's this organic chemistry problem that talks about uh electro cyclic reactions and

also there's this linguistic problem that tries to ask you about distinguishing between closed and open

syllabus uh from a hub Hebrew source text. So you can see also it's a very

wide range of problems and every single problem is PhD or even advanced research

level problems. Yeah. I mean the these there are no humans that can actually answer these

can get a good score. I mean if you actually say like any given human um what like what's the best that any human

could score? I mean I'd say maybe 5% optimistically.

Yeah. Um, so this this is much harder than than what any any human can do.

It's it's incredibly difficult and you can see from the the types of questions like you might be incredible in

linguistics or mathematics or chemistry or physics or anyone of a number of subjects, but you're not going to be um

at a postgrad level in everything. And Grock 4 is a postgrad level in

everything. Like it's it just some of these things are just worth repeating. like Grockport is postgraduate like PhD

level in everything better than PhD but like most PhDs would fail so it's better

said I mean at least with respect to academic questions it I want to just emphasize this point with respect to

academic questions Gro is better than PhD level in every subject no exceptions

um now this doesn't mean that it's it you know at times it may lack common

sense and it has not yet invented new technologies or discovered new physics

but that is just a matter of time. Um if it I I I think it may discover new

technologies uh as soon as later this year. Um and I

I would be shocked if it has not done so next year. So I would expect Grock to

yeah literally discover new new technologies that are actually useful no later than next year and maybe enter

this year. Um, and it might discover new physics next year and within two years

I'd say almost certainly like so just let that sink in.

Yeah. So yeah um how okay so I guess we can

talk about the the what what's behind the scene of graph 4. As Jimmy mentioned, uh, we actually throwing a

lot of compute into this training. You know, when it started, it's only also a

single digit. Sorry. Um, the previous slide, sorry. Yeah, it's only a single

digit u number, but as you start putting in more and more training compute, it

started to gradually become smarter and smarter and eventually solved a quarter

of the HL problems. And this is without any tools. The next thing we did was to

adding a tools capabilities to the model. And unlike GR 3, I think G3

actually is able to use C as well. But here we actually make it more native in the sense that we put the tools into

training. Uh G3 was only relying on generalization. Here we actually put the

tools into training. And it turns out this significantly improves the model's capability of using those tools.

Yeah. I remember we had like deep search back in the days. So how is this different?

Yeah. Yeah. Exactly. So deep search was exactly the graph 3 reasoning model

uh but without any specific training but we only asked it to use those tools. So

compared to this it was much weaker in terms of its tool capabilities and unreliable

and unreliable. Yes. Yes. Yes. And and to be clear like these are still I'd say fairly this is still fairly primitive

tool use. If you compare it to say the tools that I used at Tesla or SpaceX uh where you're using um you know finite

element analysis and computational flow dynamics and and you're you're able to

run uh or say like Tesla does like crash simulations where the simulations are so

close to reality that if the test doesn't match the simulation you assume that the test article is wrong. That's

how good the simulations are. So Grock is not currently using uh any of the tools the the really powerful tools that

a company would use but but that is something that we will provide it with later this year. So it will have the

tools that that a company has um and and have very accurate physics simulator. Um

ultimately the the thing that will make the biggest difference is being able to interact with the real world via

humanoid robots. So you combine sort of grock with with Optimus and it can actually interact with the real world

and figure out if if it's hypo if it has if it's it can formulate an hypothesis

and then confirm if that hypothesis is is true or not. Um

so we're really you know think about like where we are today. We're we're at the beginning of

an immense intelligence explosion. We're in we're in the intelligence big bang

right now. Um and the mo we're at the most interesting time to be alive of any time in history.

Yeah. Now that said, we need to make sure that the AI is um

a good AI. Uh good Grock. Um and the the thing that

I think is most important for AI safety, at least my biological neural net tells me the most important thing for AI is to

be maximally truth seeeking. So this is this is a very fundamental um

like you can think of AI as this super genius child that ultimately will

outsmart you but you can still in you can instill the right values um and

encourage it to be sort of you know uh truthful

uh I don't know honorable you know good good things like

the values you want to instill in in a child that that that grow would grow

ultimately grow up to be incredibly powerful. Um

yeah. Mhm. So yeah, so these this is really I say we say we say tools these

are say still primitive tools not the kind of tools that um that serious

commercial companies use but we we will provide it with those tools and uh I think it will be able to solve with

those tools real world technology problems in fact I'm I'm certain of it it's just a question of how long it

takes. Yes. Yes. Exactly. Um, so is it just compute all you need,

Tony? Right. Is it just compute all you need at this point?

Well, you you need compute plus plus uh the right tools. Mhm. And um and then ultimately to be able to

interact with physical world. Yes. Um, and then

I mean we'll effectively have an economy that is well ultimately an economy that is

thousands of times bigger than our current economy or maybe millions of times. Mhm. I mean if you if you think of

civilization as percentage completion of the kadeshv scale where keshv one is

using all the energy output of a planet and kv 2 is using all the energy output

of a sun and three is all the energy output of a galaxy. We're we're only in

my opinion probably like close closer to 1% of kartev 1 um than we are to 10%. So

like maybe a point or one one or two percent of kadeshv one. So

um we we will get to um most of the way like 80 90% kadeshv

one and then hopefully if civilization doesn't self annihilate and then kadeshv 2 like

it's the the actual notion of of a human economy assuming civilization continues to progress will seem very quaint. Um in

in retrospect, um it will it will seem like uh sort of cavemen throwing sticks into a

fire uh level of economy um compared to what the future will hold.

Um I mean it's very exciting. I mean, I I've been at times kind of worried about

like, well, you know, is this this seems like it's

it's somewhat unnerving to have intelligence created that is far greater than our own.

Um, and will this be bad or good for humanity? Um,

it's like I I I think it'll be good. Most likely it'll be good. Um,

yeah. Yeah. But I somewhat reconciled myself

to the fact that even if I even even if it wasn't going to be good, I'd at least

like to be alive to see it happen. So, yeah.

So, actually one one Yeah. Yeah. I think one one technical

problem that we still need to solve besides just compute is how do we unblock the data uh data um bottleneck u

because um when we try to scale up the RL uh in this case we did invent a lot

of new techniques innovations to allow us to figure out how to find a lot of a

lot of challenging RL problems to work on. It's not just the problem itself needs to be challenging but also it

needs to be um you also need to have like a reliable uh signal to tell the

model you did it wrong you did it right. This is the sort of the principle of reinforcement learning and as the models

get smarter and smarter the number of cool problem or challenging problems will be lesser and less.

Yeah. So it's going to be a new type of challenge that we need to surpass besides just compute. Yeah.

Yeah. And we actually are running out of of of actual test questions to ask. Uh

so there's like even ridiculously questions that are ridiculously hard if not essentially impossible for humans

that are written down questions um are uh becoming swiftly becoming trivial for

for AI. Mhm. Um so then there's um but you know the the one thing that

is an excellent judge of things is reality. So because physics is the law

ultimately everything else is a recommendation. You can't break physics. Um so the ultimate test I think for

whether an AI is um the the ultimate reasoning test is reality.

Yes. So you invent a new technology like say improve the design of a car or a rocket

or um create a new medication uh that and and and does it work?

Yeah. Um does does the rocket get to orbit? Does the does the car drive? Does the

medicine work? Whatever the case may be. Um, reality is the ultimate judge here.

Um, so it's it's going to be a reinforcement learning closing loop around reality.

We asked the question how do we even go further? So um, actually we are thinking

about now with single agent we are able to solve 40% of a problem.

What if we have multiple agents running at the same time? So this is what's called test and compute and as we scale

up the test and compute actually we are able to solve almost more than 50% of

the uh text only subset of the h problems. So it's a remarkable

achievement I think. Yeah. Yeah. This is this isn't this is insanely difficult. The these are it's it's so what we're saying is like a

majority of the of the of the textbased um of humanities you know scarily named

humanities last exam um GR for can solve and you and you can try it out for yourself. Um and the with the GRG for

heavy what what it does is it spawns multiple agents in parallel and uh all of those agents do do work independently

and then they compare their work and they they decide which one like it's

like a study group. Um and it's not as simple as a majority vote because often

only one of the agents actually figures out the trick or figures out the solution. Um and and and but once they

share the the trick or or or figure out what what the real nature of the problem is, they share that uh solution with the

other agents and then they compare they essentially compare notes and then and and then yield uh yield an answer.

Yeah. So that's that's the the heavy part of Grockful is is where we you scale up the test time compute by

roughly an order of magnitude. Uh have multiple agents uh tackle the task and

then they compare their work and they they put forward what they think is the best result.

Yeah. So we will introduce graph 4 and graph for heavy. Sorry, can you click the next slide? Yeah.

Yes. So yeah. So basically graph 4 is the single version a single agent

version and graph four heavy is the multi- aent version. So let's take a look how they actually

do on those exam problems and also some real real life problems. Yeah. So we're going to start out here

and we're actually going to look at one of those HLE problems. This is uh actually one of the easier math ones. Uh

I don't really understand it very well. I'm not that smart. But I can launch this job here and we can actually see

how it's going to go through and start to think about this problem. Uh while we're doing that, I also want to show a

little bit more about like what this model can do and launch a uh Gro 4 heavy as well. So everyone knows Poly Market.

Um it's extremely interesting. It's the you know seeker of truth. It aligns with what reality is most of the time. And

with Grock, what we're actually looking at is being able to see how we can try to take these markets and see if we can

predict the the future as well. So, as we're letting this run, we'll see how uh Grock 4 Heavy goes about uh predicting

the, you know, the World Series odds for like the current teams in the MLB. And while we're waiting for these to

process, we're going to pass it over to Eric, and he's going to show you an example of his.

Yeah. So uh I guess one of the coolest things about uh Gro 4 is its ability to

understand the world and to solve hard problems by leveraging tools like Tony discussed. And I think one kind of cool

example of this we asked it to generate a visualization of two black holes

colliding. Um and of course you know it took some there are some liberties. It's

in my case actually pretty clear in its thinking trace about what these liberties are. Uh for example, in order

for it to actually be visible, you need to really exaggerate the the scale of the you know the uh the uh the waves and

yeah so here's like you know this kind of inaction. Um it exaggerates the scale

in like multiple ways. uh it drops off a bit less in terms of amplitude um over

distance and uh but yeah we can kind of see uh the basic effects that you know

are actually like you know correct. It starts with the inspiral, it merges uh

and then you have um the ring down and like this is basically um largely

correct um yeah uh modulo some of the simplifications that need to do um you

know it it's actually quite explicit about this you know it uses like post postnonian approximations instead of

actually like computing the general relativistic effects at like uh near the

center of the black hole which is you know incorrect um and you know will lead to you know some incorrect results but

the overall you know visualization is uh yeah is basically there um and you can

actually look at the kinds of resources that it references. So here it um it

actually you know it obviously uses search it gathers results from a bunch of links but also reads through a

undergraduate text in analytical analytic gravitational wave models. Uh

it um yeah it um reasons quite a bit

about uh uh the actual constants that it should use for a realistic simulation.

uh it references uh I guess existing real world data. Um and yeah I it yeah

it's a it's a pretty good model. Uh yeah but like actually going forward we can

we can we can plug we can give it the same model that physicists use. So it it can run the the same level of compute

that uh so leading physics researchers are using and and give you a physics

accurate black hole simulation. Exactly. Just right now it's running in your browser. So

yeah, this is just running in your browser. Exactly. Pretty simple. So swapping back real quick here. We can

actually take a look. The math problem is finished. Uh the model was able to uh let's look at its thinking trace here.

So you can see how it went through the problem. I'll be honest with you guys, I really don't quite fully understand the

math, but what I do know is that I looked at the answer ahead of time. Um, and it did come to the the correct

answer here in the final part here. Um, we can also come in and actually take a look here at our uh our World Series

prediction. Um, and it's still thinking through on this one, but we can actually try some other stuff as well. So, we can

actually like try some of the X integrations that we did. So, we worked very heavily on working with uh all of

our X tools and building out a really great X experience. So we can actually ask, you know, the model, you know, find

me the XAI employee that has the weirdest profile photo. Um, so that's going to go off and start that. And then

we can actually try out, you know, let's, uh, create a timeline based on XOST, uh, detailing the, you know,

changes in the scores over time. And we can see, you know, all the conversation that was taking place at that time as

well. So we can see who were the, you know, announcing scores and like what was the reactions at those times as well. Um, so we'll let that go through

here and process. And if we go back to uh uh this was the uh Greg Yang photo

here. So uh if we scroll through here, whoops. So Greg Yang, of course, who has

his favorite uh photograph that he has on his account. Um that's actually not how he looks like in real life, by the

way. Um just so aware, but is quite funny. But it had to understand that question. Yeah.

Which is the that's the wild part. It's like it understands what is a weird photo. What is a weird photo?

Yeah. What is a less or more weird photo? It goes through it has to find all the team members. It has to figure out who

we all are. It, you know, searches without access to the internal XAI personnel. It's literally looking at the

just at the internet. Exactly. So you could say like the weirdest of any company. Yeah. To be clear.

Exactly. And uh we can also take a look here at the uh question here for the humanities last exam. So, it is still

researching all of the historical scores. Um, but it will have that final answer here soon, but we can while it's

finishing up, we can take a look at one of the ones that we set up here a second ago. And we can see like, you know, it

defines the date that like Dan Hendricks initially announced it. We can go through, we can see, you know, uh, OpenAI announcing their score back in

uh, February. And we can see, you know, as progress happens with like Gemini. we can see like Kimmy uh and we can also

even see you know the leaked benchmarks of uh what people are saying is you know if it's right it's going to be pretty

impressive so pretty cool um so yeah I'm looking forward to seeing how everybody

uses these tools and gets the most value out of them um but uh yeah it's been great

yeah and we're going to close the loop around usefulness as well so it's like it's not just books smart but actually

practically smart exactly All right.

And we can go back to the uh the slides here. Yeah. So cool. Um so the so we actually evaluate

also on the multimodel subset. So on the full set uh this is the number on the h

exam. Uh it you can see there's a little dip on the numbers. uh this is actually

something we're improving on which is the multimodel uh understanding capabilities but I do believe uh in a

very short time we're able to really improve and got much higher numbers on

this um even higher numbers on this benchmark. Yeah. Yeah. This is the we we saw like what

what is the biggest weakness of Grock currently is that it's it's sort of partially blind. it can't it's its image

understanding um obviously and its image generation uh need to be a lot better um

and that um that that's actually um being trained right now. So um GR 4 is

based on version six of our foundation model um and we are training version

seven uh which we'll complete in a few weeks um and uh that that'll address the

weakness on the vision side. Mhm. And just to show off this last here. So the the prediction market it finished uh

here with a heavy and we can see uh here we can see all the tools and the process it used to actually uh go through and

find the right answer. So it browsed a lot of odd sites. It calculated its own odds comparing to the market the market

to find its own alpha and edge. It walks you through the entire process here and it calculates the odds of the winner

being like the the Dodgers. uh and it gives them a 21.6% chance of uh winning uh this year. So,

and it took approx approximately four and a half minutes to compute. Yeah, that's a lot of thinking.

Yeah,

we can also look at all uh the other benchmarks besides HLE. Um, as it turned

out, GR 4 excelled on all the reasoning benchmarks that people usually test on.

Um, including GBQA, which is a PhD level, uh, problem sets. Uh, that's

easier compared to HRE. Um, on Amy 25 American Invitation Mathematics exam, we

with graph for heavy, we actually got a perfect score. Um, also on some of the

coding benchmark called live coding bunch. um and also on uh HMMT, Harvard

math uh MIT u exam and also USMO. Uh you can see actually um for on all of those

benchmarks we often have a very large leap against the second best uh model

out there. Yeah, it's I mean really we're going to get to the point where uh it's going to

get every answer right in every exam. um and where it doesn't get an answer right, it's going to tell you what's

wrong with the question. Or if the question uh is ambiguous, disambiguate the question into answers A, B, and C

and tell you what it would what answers A, B, and C would be with a disambiguated question. Um so the only

real test then will be reality. Uh can I make useful technologies, discover um

new science? That will actually be the the only thing left because human tests will simply not be um meaningful.

Yeah, we need to make an update to HRE very soon given the current rate of progress. So yeah, it's super cool to

see like multiple agents that collaborate with each other solving really challenging problems. Uh so we're

going try this model. Uh so it turned out it's available right now. Uh if we advance to the next slide uh where there

is a super Grock heavy tiers that we're introducing where you're able to access to both Grock 4 and Grock 4 heavy uh

where you're actually going to be the taskmaster of launch of little Grock research agent to help you you know become smarter through all the little

research and save hours of times of uh you know going through mundane tasks uh and it's available right now. Uh so yeah

so uh we we did limit usage during the the demo so we didn't it didn't break

the demo because all these this all this stuff is happening live so it's not there's not nothing canned about the any

of the tests that we're doing. Um so uh after the after the uh demo is done we'll we'll allow we'll enable more

subscribers for cyber gro. So if you can't subscribe right now just try in half an hour it should work. Um, so, uh,

and now let's let's get into voice. Greatby.

So, many of you have been enjoying our voice mode and we've been working hard to improve the experience over the past

couple months. Um, we have cut latency in half to make it much snappier. And

today, we're excited to announce a set of new voices that have exceptional naturalness and procity. You might have

noticed the movie trailer voice that opened up the live stream. That is one of our new voices, S, who's got that

epically deep tone. And we're also excited to introduce Eve, a beautiful

British voice who's capable of rich emotions. Aman, would you like to introduce Eve?

Absolutely. Yeah. Let's get into the demo. Hey, Eve. Millions of people are

watching us right now. Are you excited? Oh, I'm thrilled. Really, it's like

being on stage at the Old Vic with a sea of faces out there. Just imagine all those eyes and ears tuned in. It's a

proper buzz, isn't it? What do you reckon we should chat about to keep this lot entertained? You know, we have a lot to talk about,

but I'll be honest. I'm feeling a little nervous. Can you whisper something soothing to calm me down?

Take a deep breath, love. You've got this. It's just you and me having a

quiet chat. Like we're tucked away in a cozy corner of a Yorkshire pub. The

world's just a murmur out there. Feel that calm wash over you. Let's keep

talking nice and easy. What's on your mind? You know what? I'm actually feeling

quite relaxed to try this um something crazy. So, can you give us an opera on Diet Coke?

Oh, an opera on Diet Coke. That's a riot. Picture this. A shimmering silver

cannon glistening under stage lights. Oh diet coke, thou elixir divine with

bubbles that dance in a sparkling line. Thy crisp cool kiss on lips so fine. Oh,

aspartame and bro forever mine.

How's that for a mad little Arya? Want me to belt out another verse or switch up the tune?

Okay, let's try switching it up and just start singing. Right, then let's have a go. Oh, silver

can, you gleam in the light, diet coke, my heart's delight. Okay, I'll touch it. It's very, very

poetic. Um, yeah, thank you, Eve. Feeling quite nice now. Cool. Um, so that's Eve. And um so

something else we do here is is a little test each day to check on the natural

flow of our conversations against some of our competitors. So we figured why

not try that live on the stream today? Uh should I start?

Hey, I'm going to say a number and can you say the same number back to me? One. One. Got it. This is like a fun little

game of echo. What's next? Two.

Sorry, you said two. Three.

Three. Four. Nice and steady. What's next? Um. Uh. Five.

Five. Nailed it. This is pretty fun, actually. Feel free to keep going or switch it up if you'd like.

Hey Grock. Hey Grock.

Hello there. How can I help you today? We're going to do a little test. Can you repeat after me? One.

One. All right. What's next? Two. Two. What's on your mind?

Three. Three. Need anything else? Four.

Four. How can I assist you? Five. Five. What's next?

So, as you can see, um, Groc was, uh, snappier. It didn't interrupt me. And the procity is we've made different

design choices. I think we shooting for something more calm, smooth, more natural versus something that's more

poppy or artificial. So, we'll keep improving on these fronts. All right. Thanks, guys. Yeah.

Yep. So since the launch of the voice model uh we actually see the 2x faster

end to end latency uh in the last 80 weeks five different voices and also 10x

the active user. So Grock voice is taking off. Now um if you think about

releasing the models this time we're also releasing Grock 4 through the API at the same time. So if we go to the

next two slides. So uh you know we're very excited about you know what all developers out there is going to build.

So you know if I think about myself as a developer what the first thing I'm going to do when I actually have access to the

graph for API benchmarks. So we actually ask around on the X platform what is the

most challenging benchmarks out there that you know is considered the holy grail for all the AGI models. Uh so turn

out AGI is in the name RAGI. So uh the last 12 hours uh you know kudos to Greg

over here in the audience uh so who answered our call take a preview of the

Gro for API and independently verified you know the gra for's performance. So initially we thought hey graph is just

you know we think it's pretty good it's pretty smart uh it's our nextgen reasoning model spend 10x more compute

it can use all the tools right but turned out when we actually verify on

the private subset of the RKGI v2 it was like the only model in the last three

months that breaks the 10% barrier and in fact was so good that actually get to 16% well 15.8% 8% accuracy 2x of the

second place that is the cloud for outpus model um and it's not just about

performance right when you think about intelligence having the API model drives your automation it's also the

intelligence per dollar right if you look at the plots over here the gro is just four just in the league of its own

um all right so enough of benchmarks uh over here right so what can gro do actually uh in the real world So uh we

actually uh you know contacted the folks from uh uh Endon Labs uh who you know

were you know gracious enough to you know try to gro in the real world to run a business. Yeah thanks for having us. So I'm Axel

from Labs and I'm Lucas and we tested Gro 4 on vending bench. Vending Bench is an AI

simulation of a business scenario uh where we thought what is the most simple

business an AI could possibly run and we thought vending machines. Uh so in this scenario the the Grock and other models

needed to do stuff like uh manage inventory, contract contact suppliers,

set prices. All of these things are super easy and all of they like all the models can do them one by one. But when

you do them over very long horizons, most models struggle. Uh but we have a leaderboard and there's a new number

one. Yeah. So we got early access to the GR 4 API. Uh we ran it on the bending bench

and we saw some really impressive results. So it ranks definitely at the number one spot. It's even double the

net worth which is the measure that we have on this value. So it's not about a percentage on a uh or a score you get,

but it's more the dollar value in net worth that you generate. So we were impressed by Grock. It was able to

formulate a strategy and adhere to that strategy over long period of time much

longer than other models that we have tested other frontier models. So it managed to run the uh simulation for

double the time and score yeah double the net worth and it was also really consistent uh across these runs which is

something that's really important when you want to use this in the real world. And I think as we give more and more

power to AI systems in the real world, it's important that we test them in scenarios that either mimic the real

world or are in the real world itself. Um because otherwise we we fly blind into some uh some things that uh that

might not be great. Yeah, it's uh it's great to see that we've now got a way to pay for all those GPUs. So we just uh need a million

vending machines. Definitely. Um, and uh we could make uh $4.7 billion a year with a million

vending machines. 100%. Let's go. They're going to be epic vending machines. Yes. Yes. All right. We are actually going to

install vending machines here. Uh like a lot of them. We're happy to supply them. All right. Thank you.

All right. I'm looking forward to seeing what amazing things are in this vending machine. That's that's for uh for you to decide.

All right. Tell the AI. Okay. Sounds good. Um All right. Yeah. I

mean, so we can see like Grock is able to become like the co-pilot of uh the business unit. So what else can Grock

do? So we're actually releasing this Grog if you want to try it uh right now to evaluate run the same benchmark as

us. Uh it's on the API um has 256k

contact length. So we already actually see some of the early early adopters to try Guac for API. So uh our Palo Alto

neighbor ARC Institute which is a leading uh biomedical research uh center is already using seeing like how can

they automate their research flows with Gro for uh it turned out it performs it's able to help the scientists to

sniff through you know millions of experiments logs and then you know just like pick the best hypothesis within a

split of seconds. uh we see this is being used for their like the crisper uh research and also uh you know Grog 4

independently evaluate scores as the best model to exam the chess x-ray uh

who would know um and uh uh on in the financial sector we also see you know

the graph with access to all the tools real-time information is actually one of the most popular AIs out there so uh you

know our graph is also going to be available on the hyperscalers so the XAI enterprise sector

is only, you know, started two months ago and we're open for business. Um,

yeah. So, u the other thing, uh, we talked a lot about, you know, having Grock to make games, uh, video games.

Uh, so Denny is actually a, uh, video game designers on X. So, uh, you know,

we mentioned, hey, who want to try out some, uh, uh, Grock for uh, preview APIs, uh, to make games? And Danny

answered the call. Uh so this was actually just made first-person shooting game in a span of 4 hours. Uh so uh some

of the actually the unappreciated hardest problem of making video games is not necessarily encoding the core logic

of the game but actually go out source all the assets all the textures files

and and uh you know to create a visually appealing game. So one of the core aspect guac for does really well with

all the tools out there is actually able to automate these like asset sourcing

capabilities. So the developers you can just focus on the core development itself rather than like you know so now

you can run a you know entire game studios with game of one but with like

one person and then uh you can have gro 4 to go out and source all those assets do all the maintaining tasks for you.

Yeah. The now the next step obviously is for Grock to uh play be able to play the

games. So it has to have very good video understanding. So it can play the games and interact with the games and actually

assess what whether a game is fun and and and actually have good judgment for whether a game is fun or not. Um so with

the with version seven of our foundation model which finishes training this month and then we'll go through post training

RL and whatnot. um that that will have excellent video understanding. Um and

with the with video understanding and the and improved tool use, for example, for video for video games, you'd want to

use, you know, Unreal Engine or Unity or one of the one of the the main graphics engines. um and then gen generate the

generate the art uh apply it to a 3D model uh and then create an executable

that someone can run on a PC or or a console or or a phone. Um

like we we expect that to happen probably this year. Um and if not this

year, certainly next year. U so that's uh it's going to be wild. I would expect

the first really good AI video game to be next year. Um,

and probably the first uh half hour of watchable

TV this year and probably the first watchable AI movie next year. Like

things are really moving at an incredible pace. Yeah. When Gro is 10xing world economy

with vending machines, they would just create video games for human. Yeah. I mean, it went from not being

able to do any of this uh really even 6 months ago, right, to to what you're seeing before you here

and and from from very primitive a year ago uh to making

a a sort of a 3D video game with with a few hours of prompting.

Yep. I mean yeah just to recap so in today's live stream we introduced the

most powerful most intelligent AI models out there that can actually reason from the first principle using all the tools

do all the research go on the journey for 10 minutes come back with the the most correct answer for you. Um so it's

kind of crazy to think about just like four months ago we had Gro 3 and now we already have Gro 4 and we're going to

continue accelerate as a company XAI we're going to be the fastest moving a AGI companies out there. So what's

coming next is that we're going to, you know, continue developing the model that's not just, you know, intelligent,

smart, think for a really long time, spend a lot of compute, but having a model that actually both fast and smart

is going to be the core focus, right? So if you think about what are the applications out there that can really

benefit from all those very intelligent, fast and smart models and coding is actually one of them.

Yeah. So the team is currently working very heavily on coding models. Um I think uh right now the main focus is we

actually trained recently a specialized coding model which is going to be both fast and smart. Um and I believe we can

share with that model with you guys with all of you uh in a few weeks. Yeah.

Yeah. That's very exciting. And uh you know the second after coding is we all see the weakness of Grog 4 is uh the

multimodal capability. So in fact uh it was so bad that you know Grock

effectively just like looking at the world squinting through the glass and like see all the blurry uh you know

features and trying to make sense of it. Uh the most immediate improvement we're going to see with the next generation

pre-trained model is that we're going to see a step function improvement on the model's capability in terms of image understanding video understanding and

audios. Right? It's now the model is able to hear and see the world just like any of you. Right? And now with all the

tools at this command with all the other agents it can talk to uh you know so

we're going to see a huge unlock for many different application layers. uh after the multimodal agents what's going

to come after is the video generation and we believe that you know at the end of the day it should just be you know

pixel in pixel out u and um you know imagine a world where you have this

infinite scroll of content in inventory on the X platform um where not only you

can actually watch these generate videos but able to intervene create your own adventures

if you're just going to be Um, and we expect to be training our video model with uh over 100,000 GB200s

uh and uh to begin that training within the next 3 or 4 weeks. So, we're we're

confident it's going to be pretty spectacular in in video generation and video understanding.

So, let's see. So that's uh

anything anything you guys want to say? Other than that, I guess that's it.

Yeah, it's it's a good model, sir. Good model. It's a good Well, we're very excited for you guys to

try Gro 4. Yeah. Yeah. Thank you. All right. Thanks, everyone. Thank you. Good night.