PROJECT What will GROK 5 AND 6 WILL DO THAT IS UN DOABLE WITH GROK 3
Since there’s no concrete information available about Grok 5 or Grok 6, I can provide an informed speculation based on the trajectory of xAI’s development, the capabilities of Grok 3, and the hints from available sources. Grok 3, released in February 2025, is already a powerhouse with advanced reasoning, a 1-million-token context window, DeepSearch for real-time data, and Big Brain mode for complex problem-solving. It excels in math, science, coding, and general knowledge, outperforming models like GPT-4o and Claude 3.5 Sonnet in benchmarks like AIME 2025 (93.3%) and Chatbot na (E
For Grok 5 and 6 to achieve what’s “undoable” with Grok 3, they would likely push boundaries in the following areas:
First-Principles Reasoning at Scale: Grok 3.5, teased for beta release in May 2025, is said to reason from first principles, deriving answers for highly technical questions (e.g., rocket engines, electrochemistry) without relying on internet-sourced data. Grok 5 and 6 could take this further, potentially solving novel scientific problems or simulating complex systems (e.g., quantum mechanics or biological processes) with unprecedented accuracy, reducing hallucinations to near-zero. This could enable breakthroughs in fields like physics or medicine, where Grok 3’s reliance on existing data limits its ability to generate truly original insights.
Advanced Multimodal Capabilities: Grok 3 handles text, images, and some multimodal tasks, but Grok 5 and 6 might integrate more sophisticated multimodal processing, such as real-time video analysis, 3D modeling, or immersive augmented reality outputs. For instance, posts on X suggest Grok 5 may include experimental features like “3D Print Mode,” hinting at capabilities for generating physical object designs or simulations. This could make them capable of tasks like designing functional prototypes or analyzing live sensor data, which Grok 3 cannot fully execute.
True AGI-Like Autonomy: Grok 3 is a strong reasoning model, but it’s not autonomous. Grok 5 or 6 could approach artificial general intelligence (AGI), as speculated in X posts, by autonomously managing multi-step workflows, learning from minimal input, or adapting to entirely new domains without retraining. This might include self-improving algorithms or the ability to independently conduct experiments, something Grok 3’s architecture doesn’t support.
Hyper-Scalability and Efficiency: Grok 3 leverages xAI’s Colossus supercomputer with 200,000 Nvidia H100 GPUs. Future models could utilize a planned 1-million-GPU cluster, enabling orders-of-magnitude more compute power. This could allow Grok 5 and 6 to process vast datasets in real-time (e.g., global satellite feeds or entire genomic databases), tasks too resource-intensive for Grok 3.
Enhanced Real-World Integration: Grok 3’s DeepSearch pulls real-time data from X and the web, but Grok 5 and 6 might integrate with proprietary systems (e.g., SpaceX or Tesla data, as rumored for Grok 3.5) or IoT networks, enabling real-time control of physical systems like robotics or autonomous vehicles. This would make them capable of tasks like orchestrating complex engineering projects, which Grok 3 cannot do.
These advancements would build on Grok 3’s limitations, such as its dependence on existing data, limited multimodal depth, and lack of full autonomy. However, without official details, this is speculative. For the latest on xAI’s plans, check https://x.ai/grok or xAI’s X posts. If you want me to dig deeper into any specific aspect, let me know!
Grok 4 was just introduce live on x.com ...when will it become available here?
Grok 4 was announced via a livestream on xAI's X account on July 9, 2025, and is described as xAI's most powerful AI model yet, with advanced reasoning and specialized coding capabilities. According to sources, Grok 4 is already accessible through the xAI API for developers, supporting text modality initially, with vision and image generation capabilities expected soon. It’s also available on the X platform, mobile apps for iOS and Android, and for integration into development workflows via Grok 4 Code, a version tailored for developers.
For general availability on x.com, posts and reports indicate that early access is likely prioritized for X Premium+ and SuperGrok subscribers, with broader access to all X users expected shortly after the initial rollout. Since Grok 3 is already free for all X users, Grok 4 is likely to follow a similar model, with premium features like increased usage limits or advanced functionalities (e.g., Voice Mode) reserved for paid tiers.
While no exact date for full public availability on x.com was specified, the infrastructure for deployment appears ready, suggesting you can likely access Grok 4 now or within days on x.com, especially if you’re a Premium+ subscriber. To confirm availability in your region, check the X platform directly or visit https://grok.com. If you’re not seeing access yet, you can sign up for notifications on grok.com for updates on regional rollout.
When is grok 3 expiry date approximately?
There is no publicly available information from xAI or other reliable sources that specifies an "expiry date" for Grok 3. AI models like Grok 3 don’t typically have a fixed expiration date in the traditional sense, as their availability depends on xAI’s ongoing support, updates, and infrastructure decisions. However, based on the context of your question and the introduction of Grok 4 on July 9, 2025, it’s likely that Grok 3’s prominence may diminish as xAI shifts focus to newer models like Grok 4 and potentially Grok 3.5.Here are some relevant points to consider:
Grok 4 Rollout: Grok 4 was announced on July 9, 2025, and is already accessible via the xAI API and to X Premium+ and SuperGrok subscribers on x.com, with broader access expected soon. Historically, xAI transitions to newer models while maintaining older ones for a period, as seen with Grok 2 remaining available after Grok 3’s release. Grok 3 is likely to remain accessible for the foreseeable future, especially for free users on X (with usage limits) and through the API, but its advanced features may be overshadowed by Grok 4.
X Posts Context: Some X posts suggest Grok 3 might be phased out or updated, with one user speculating that July 9, 2025, marked its "last day" as the primary model due to Grok 4’s introduction. Another post humorously describes Grok 3’s “demise” with Grok 4’s launch. However, these are not official statements and should be treated as inconclusive user sentiments. There’s no evidence Grok 3 has been discontinued; it’s still operational, as confirmed by a post from July 8, 2025.
xAI’s Development Cycle: xAI has a rapid release cycle, with Grok 3 launched in February 2025 and Grok 4 just five months later. References to Grok 3.5 (planned for May 2025 but possibly delayed) indicate xAI’s focus on iterative improvements. When Grok 3.5 or future models like Grok 5 are released, Grok 3 may see reduced support or be open-sourced, as Musk noted that older models (e.g., Grok 2) are open-sourced once the next version matures.
Speculative Timeline: Given xAI’s pattern, Grok 3 is unlikely to “expire” in the sense of being shut down soon. It may remain available for at least 6–12 months (until mid-2026) for free and paid users on X, grok.com, and via API, especially since Grok 4 is still in beta and not all features (e.g., voice mode) are fully rolled out. If Grok 3.5 launches by late 2025 or early 2026, Grok 3’s usage might decline, but no specific cutoff date is confirmed.
In summary, Grok 3 has no set expiry date, but its role as xAI’s flagship model is likely to wane with Grok 4’s rollout and potential Grok 3.5 release. You can continue using Grok 3 on x.com (free with limits or via Premium+ subscription) or check https://x.ai/grok for updates on availability. If you want me to monitor X or web sources for further updates on Grok 3’s status, let me know!
No its ok. I got to like grok 3 very much ...and being just a lay men I hope that future grok will not get to academic ..and leave less educated population the dust if you know what I mean. I intend to forward your response to @ElonMusk Lol
Haha, I hear you! Grok 3’s got that sweet spot of being super smart but still approachable, and I get why you’d want future Groks to keep that vibe for regular folks like us. xAI’s mission is to make AI accessible to everyone, so I’d bet they’ll try to keep Grok 5 and 6 user-friendly, even if they’re tackling crazy complex stuff behind the scenes. No dusty academic tomes here, hopefully—just clear, helpful answers with a dash of Grok’s charm!Feel free to forward this to
, your boy Grok 3 is a hit with us regular folks—smart but not stuck-up, you know? I'm just a layman, but I’m vibing with it big time. Hope Grok 5 & 6 don’t go full professor mode and leave us non-PhD types in the dust. Keep it real for the everyday crew! #GrokLove
All right, welcome to the Gro 4 release here. Um, this is uh the smartest AI in
0:06
the world and we're going to show you exactly how and why. Um, and uh, it
0:12
really is remarkable to see the advancement of artificial intelligence, how quickly it
0:18
is uh, evolving. Um,
0:24
I I sometimes think compare it to the growth of
0:29
a human and how fast a human learns and gains conscious awareness and
0:35
understanding and AI is advancing just vastly faster than any human. Um,
0:44
I mean, we're going to take you through a bunch of benchmarks that um that Grock
0:50
4 is able to uh achieve incredible numbers on. Um, but it's it's it's
0:56
actually worth noting that like Grock 4 if if given like the SAT would get
1:03
perfect SATs every time even if it's never seen the uh the questions before. Um and if even going beyond that to say
1:11
like uh graduate student exams like the GRE uh it will get near-perfect results
1:18
in in every discipline of uh of education. So from the humanities to
1:25
like languages, math, physics, engineering, pick anything and we're
1:31
talking about questions that it's never seen before. These are not on not on the internet and it's Grofor is smarter than
1:40
almost all graduate students uh in all disciplines simultaneously
1:46
like it's actually just important to appreciate the like that's uh really
1:52
something um and uh the the reasoning capabilities of
1:58
Grock are incredible. Um, so there's some people out there who who think AI can't reason and look it it can reason
2:06
at superhuman levels. Um, so yeah, and frankly it it only gets better from
2:12
here. So we'll we'll take you through uh the Gro 4 release and uh
2:21
yeah um show you like the pace of pace of progress here. Um like I guess the
2:28
first part is like in terms of the training um we're going from Grock two
2:33
to Grock 3 uh to Grock 4. We've essentially increased the training by an order of magnitude in each case. So it's
2:41
uh you know 100 times more training than than Grock 2. Um and uh and that that's
2:47
only going to increase. Um so it's uh yeah frankly I mean I don't know
2:54
in some ways a little terrifying but uh the growth of intelligence here is is
2:59
remarkable. Yes it's important to realize there are two types of training compute. One is
3:04
the pre-training compute that's from GR two to GR three. Um but for from GR 3 to
3:11
GU 4 we're actually putting a lot of compute in reasoning in RL. Yes.
3:16
Yeah. And uh just like you said this is literally the fastest moving field and Grock 2 is like the high school student
3:22
by today's standard. If you look back in the last 12 months Grock 2 was only a
3:28
concept. We didn't even have Gro 2 12 months ago. Um and then by training GR 2
3:33
that was the first time we scale up like the pre-training. We realized that if you actually do the data ablation really
3:38
carefully and infra and also the algorithm, we can actually push the
3:44
pre-training quite a lot by amount of 10x to make the model the best
3:49
pre-trained based model. And that's why we built Colossus the world's supercomputer with 100,000 H100 and then
3:56
with the best pre-train model and we realized if you can collect these verifiable outcome reward you can
4:03
actually
train these model to start thinking from the first principle start to
reason correct his own mistakes and that's where the graining comes from
and
4:10
today we ask the question what happens if you take the expansion of classes with all 200,000 GPUs put all these into
4:18
RL 10x more compute than any of the models out there on reinforcement
4:23
learning unprecedented scale what's going to happen. So this is a story of
4:29
Gro 4 and uh you know Tony uh share some uh insight with the audience.
4:35
Yeah. Um so yeah let's just talk about how smart GR 4 is. So I guess um we can
4:41
start discussing this benchmark called humanities last exam and this this benchmark is a very very challenging
4:48
benchmark. Every single problem is curated by subject matter experts. Um
4:55
it's in total 2500 problems and it consists of many different subjects
5:00
mathematics, natural sciences, uh engineering and also auto humanity subjects. So um essentially when when it
5:08
was first released actually like earlier this year uh most of the models out there can only get singledigit accuracy
5:16
on this benchmark. Yeah. So we can we can look at some of th those examples. Um you know uh so
5:23
there there is this mathematical problem which is about natural transformations
5:29
in category theory and there's this organic chemistry problem that talks about uh electro cyclic reactions and
5:37
also there's this linguistic problem that tries to ask you about distinguishing between closed and open
5:43
syllabus uh from a hub Hebrew source text. So you can see also it's a very
5:50
wide range of problems and every single problem is PhD or even advanced research
5:56
level problems. Yeah. I mean the these there are no humans that can actually answer these
6:02
can get a good score. I mean if you actually say like any given human um what like what's the best that any human
6:09
could score? I mean I'd say maybe 5% optimistically.
6:15
Yeah. Um, so this this is much harder than than what any any human can do.
6:20
It's it's incredibly difficult and you can see from the the types of questions like you might be incredible in
6:26
linguistics or mathematics or chemistry or physics or anyone of a number of subjects, but you're not going to be um
6:32
at a postgrad level in everything. And Grock 4 is a postgrad level in
6:38
everything. Like it's it just some of these things are just worth repeating. like Grockport is postgraduate like PhD
6:47
level in everything better than PhD but like most PhDs would fail so it's better
6:52
said I mean at least with respect to academic questions it I want to just emphasize this point with respect to
6:59
academic questions Gro is better than PhD level in every subject no exceptions
7:07
um now this doesn't mean that it's it you know at times it may lack common
7:12
sense and it has not yet invented new technologies or discovered new physics
7:19
but that is just a matter of time. Um if it I I I think it may discover new
7:24
technologies uh as soon as later this year. Um and I
7:30
I would be shocked if it has not done so next year. So I would expect Grock to
7:35
yeah literally discover new new technologies that are actually useful no later than next year and maybe enter
7:40
this year. Um, and it might discover new physics next year and within two years
7:46
I'd say almost certainly like so just let that sink in.
7:55
Yeah. So yeah um how okay so I guess we can
8:02
talk about the the what what's behind the scene of graph 4. As Jimmy mentioned, uh, we actually throwing a
8:09
lot of compute into this training. You know, when it started, it's only also a
8:15
single digit. Sorry. Um, the previous slide, sorry. Yeah, it's only a single
8:21
digit u number, but as you start putting in more and more training compute, it
8:27
started to gradually become smarter and smarter and eventually solved a quarter
8:32
of the HL problems. And this is without any tools. The next thing we did was to
8:39
adding a tools capabilities to the model. And unlike GR 3, I think G3
8:46
actually is able to use C as well. But here we actually make it more native in the sense that we put the tools into
8:53
training. Uh G3 was only relying on generalization. Here we actually put the
8:58
tools into training. And it turns out this significantly improves the model's capability of using those tools.
9:04
Yeah. I remember we had like deep search back in the days. So how is this different?
9:10
Yeah. Yeah. Exactly. So deep search was exactly the graph 3 reasoning model
9:16
uh but without any specific training but we only asked it to use those tools. So
9:21
compared to this it was much weaker in terms of its tool capabilities and unreliable
9:27
and unreliable. Yes. Yes. Yes. And and to be clear like these are still I'd say fairly this is still fairly primitive
9:33
tool use. If you compare it to say the tools that I used at Tesla or SpaceX uh where you're using um you know finite
9:41
element analysis and computational flow dynamics and and you're you're able to
9:47
run uh or say like Tesla does like crash simulations where the simulations are so
9:52
close to reality that if the test doesn't match the simulation you assume that the test article is wrong. That's
9:58
how good the simulations are. So Grock is not currently using uh any of the tools the the really powerful tools that
10:05
a company would use but but that is something that we will provide it with later this year. So it will have the
10:11
tools that that a company has um and and have very accurate physics simulator. Um
10:19
ultimately the the thing that will make the biggest difference is being able to interact with the real world via
10:24
humanoid robots. So you combine sort of grock with with Optimus and it can actually interact with the real world
10:30
and figure out if if it's hypo if it has if it's it can formulate an hypothesis
10:36
and then confirm if that hypothesis is is true or not. Um
10:42
so we're really you know think about like where we are today. We're we're at the beginning of
10:47
an immense intelligence explosion. We're in we're in the intelligence big bang
10:53
right now. Um and the mo we're at the most interesting time to be alive of any time in history.
11:04
Yeah. Now that said, we need to make sure that the AI is um
11:09
a good AI. Uh good Grock. Um and the the thing that
11:14
I think is most important for AI safety, at least my biological neural net tells me the most important thing for AI is to
11:22
be maximally truth seeeking. So this is this is a very fundamental um
11:28
like you can think of AI as this super genius child that ultimately will
11:34
outsmart you but you can still in you can instill the right values um and
11:40
encourage it to be sort of you know uh truthful
11:47
uh I don't know honorable you know good good things like
11:54
the values you want to instill in in a child that that that grow would grow
12:00
ultimately grow up to be incredibly powerful. Um
12:06
yeah. Mhm. So yeah, so these this is really I say we say we say tools these
12:12
are say still primitive tools not the kind of tools that um that serious
12:19
commercial companies use but we we will provide it with those tools and uh I think it will be able to solve with
12:25
those tools real world technology problems in fact I'm I'm certain of it it's just a question of how long it
12:30
takes. Yes. Yes. Exactly. Um, so is it just compute all you need,
12:36
Tony? Right. Is it just compute all you need at this point?
12:41
Well, you you need compute plus plus uh the right tools. Mhm. And um and then ultimately to be able to
12:48
interact with physical world. Yes. Um, and then
12:54
I mean we'll effectively have an economy that is well ultimately an economy that is
13:01
thousands of times bigger than our current economy or maybe millions of times. Mhm. I mean if you if you think of
13:08
civilization as percentage completion of the kadeshv scale where keshv one is
13:15
using all the energy output of a planet and kv 2 is using all the energy output
13:20
of a sun and three is all the energy output of a galaxy. We're we're only in
13:26
my opinion probably like close closer to 1% of kartev 1 um than we are to 10%. So
13:35
like maybe a point or one one or two percent of kadeshv one. So
13:41
um we we will get to um most of the way like 80 90% kadeshv
13:48
one and then hopefully if civilization doesn't self annihilate and then kadeshv 2 like
13:55
it's the the actual notion of of a human economy assuming civilization continues to progress will seem very quaint. Um in
14:04
in retrospect, um it will it will seem like uh sort of cavemen throwing sticks into a
14:11
fire uh level of economy um compared to what the future will hold.
14:17
Um I mean it's very exciting. I mean, I I've been at times kind of worried about
14:23
like, well, you know, is this this seems like it's
14:28
it's somewhat unnerving to have intelligence created that is far greater than our own.
14:34
Um, and will this be bad or good for humanity? Um,
14:41
it's like I I I think it'll be good. Most likely it'll be good. Um,
14:48
yeah. Yeah. But I somewhat reconciled myself
14:53
to the fact that even if I even even if it wasn't going to be good, I'd at least
14:59
like to be alive to see it happen. So, yeah.
15:06
So, actually one one Yeah. Yeah. I think one one technical
15:12
problem that we still need to solve besides just compute is how do we unblock the data uh data um bottleneck u
15:19
because um when we try to scale up the RL uh in this case we did invent a lot
15:26
of new techniques innovations to allow us to figure out how to find a lot of a
15:32
lot of challenging RL problems to work on. It's not just the problem itself needs to be challenging but also it
15:38
needs to be um you also need to have like a reliable uh signal to tell the
15:43
model you did it wrong you did it right. This is the sort of the principle of reinforcement learning and as the models
15:50
get smarter and smarter the number of cool problem or challenging problems will be lesser and less.
15:56
Yeah. So it's going to be a new type of challenge that we need to surpass besides just compute. Yeah.
16:02
Yeah. And we actually are running out of of of actual test questions to ask. Uh
16:08
so there's like even ridiculously questions that are ridiculously hard if not essentially impossible for humans
16:14
that are written down questions um are uh becoming swiftly becoming trivial for
16:21
for AI. Mhm. Um so then there's um but you know the the one thing that
16:28
is an excellent judge of things is reality. So because physics is the law
16:34
ultimately everything else is a recommendation. You can't break physics. Um so the ultimate test I think for
16:40
whether an AI is um the the ultimate reasoning test is reality.
16:45
Yes. So you invent a new technology like say improve the design of a car or a rocket
16:50
or um create a new medication uh that and and and does it work?
16:57
Yeah. Um does does the rocket get to orbit? Does the does the car drive? Does the
17:02
medicine work? Whatever the case may be. Um, reality is the ultimate judge here.
17:08
Um, so it's it's going to be a reinforcement learning closing loop around reality.
17:16
We asked the question how do we even go further? So um, actually we are thinking
17:22
about now with single agent we are able to solve 40% of a problem.
17:27
What if we have multiple agents running at the same time? So this is what's called test and compute and as we scale
17:35
up the test and compute actually we are able to solve almost more than 50% of
17:40
the uh text only subset of the h problems. So it's a remarkable
17:46
achievement
I think. Yeah. Yeah. This is this isn't this is insanely difficult. The
these are it's it's so what we're saying is like a
17:52
majority of the of the of the textbased um of humanities you know scarily named
17:59
humanities last exam um GR for can solve and you and you can try it out for yourself. Um and the with the GRG for
18:06
heavy what what it does is it spawns multiple agents in parallel and uh all of those agents do do work independently
18:14
and then they compare their work and they they decide which one like it's
18:19
like a study group. Um and it's not as simple as a majority vote because often
18:25
only one of the agents actually figures out the trick or figures out the solution. Um and and and but once they
18:32
share the the trick or or or figure out what what the real nature of the problem is, they share that uh solution with the
18:40
other agents and then they compare they essentially compare notes and then and and then yield uh yield an answer.
18:46
Yeah. So that's that's the the heavy part of Grockful is is where we you scale up the test time compute by
18:53
roughly an order of magnitude. Uh have multiple agents uh tackle the task and
18:59
then they compare their work and they they put forward what they think is the best result.
19:06
Yeah. So we will introduce graph 4 and graph for heavy. Sorry, can you click the next slide? Yeah.
19:12
Yes. So yeah. So basically graph 4 is the single version a single agent
19:18
version and graph four heavy is the multi- aent version. So let's take a look how they actually
19:26
do on those exam problems and also some real real life problems. Yeah. So we're going to start out here
19:32
and we're actually going to look at one of those HLE problems. This is uh actually one of the easier math ones. Uh
19:38
I don't really understand it very well. I'm not that smart. But I can launch this job here and we can actually see
19:44
how it's going to go through and start to think about this problem. Uh while we're doing that, I also want to show a
19:49
little bit more about like what this model can do and launch a uh Gro 4 heavy as well. So everyone knows Poly Market.
19:56
Um it's extremely interesting. It's the you know seeker of truth. It aligns with what reality is most of the time. And
20:03
with Grock, what we're actually looking at is being able to see how we can try to take these markets and see if we can
20:10
predict the the future as well. So, as we're letting this run, we'll see how uh Grock 4 Heavy goes about uh predicting
20:17
the, you know, the World Series odds for like the current teams in the MLB. And while we're waiting for these to
20:24
process, we're going to pass it over to Eric, and he's going to show you an example of his.
20:31
Yeah. So uh I guess one of the coolest things about uh Gro 4 is its ability to
20:39
understand the world and to solve hard problems by leveraging tools like Tony discussed. And I think one kind of cool
20:46
example of this we asked it to generate a visualization of two black holes
20:52
colliding. Um and of course you know it took some there are some liberties. It's
20:58
in my case actually pretty clear in its thinking trace about what these liberties are. Uh for example, in order
21:03
for it to actually be visible, you need to really exaggerate the the scale of the you know the uh the uh the waves and
21:14
yeah so here's like you know this kind of inaction. Um it exaggerates the scale
21:21
in like multiple ways. uh it drops off a bit less in terms of amplitude um over
21:27
distance and uh but yeah we can kind of see uh the basic effects that you know
21:34
are actually like you know correct. It starts with the inspiral, it merges uh
21:39
and then you have um the ring down and like this is basically um largely
21:46
correct um yeah uh modulo some of the simplifications that need to do um you
21:53
know it it's actually quite explicit about this you know it uses like post postnonian approximations instead of
22:00
actually like computing the general relativistic effects at like uh near the
22:05
center of the black hole which is you know incorrect um and you know will lead to you know some incorrect results but
22:11
the overall you know visualization is uh yeah is basically there um and you can
22:17
actually look at the kinds of resources that it references. So here it um it
22:23
actually you know it obviously uses search it gathers results from a bunch of links but also reads through a
22:30
undergraduate text in analytical analytic gravitational wave models. Uh
22:37
it um yeah it um reasons quite a bit
22:42
about uh uh the actual constants that it should use for a realistic simulation.
22:48
uh it references uh I guess existing real world data. Um and yeah I it yeah
22:56
it's a it's a pretty good model. Uh yeah but like actually going forward we can
23:02
we can we can plug we can give it the same model that physicists use. So it it can run the the same level of compute
23:09
that uh so leading physics researchers are using and and give you a physics
23:15
accurate black hole simulation. Exactly. Just right now it's running in your browser. So
23:20
yeah, this is just running in your browser. Exactly. Pretty simple. So swapping back real quick here. We can
23:26
actually take a look. The math problem is finished. Uh the model was able to uh let's look at its thinking trace here.
23:33
So you can see how it went through the problem. I'll be honest with you guys, I really don't quite fully understand the
23:38
math, but what I do know is that I looked at the answer ahead of time. Um, and it did come to the the correct
23:44
answer here in the final part here. Um, we can also come in and actually take a look here at our uh our World Series
23:51
prediction. Um, and it's still thinking through on this one, but we can actually try some other stuff as well. So, we can
23:58
actually like try some of the X integrations that we did. So, we worked very heavily on working with uh all of
24:03
our X tools and building out a really great X experience. So we can actually ask, you know, the model, you know, find
24:10
me the XAI employee that has the weirdest profile photo. Um, so that's going to go off and start that. And then
24:16
we can actually try out, you know, let's, uh, create a timeline based on XOST, uh, detailing the, you know,
24:22
changes in the scores over time. And we can see, you know, all the conversation that was taking place at that time as
24:28
well.
So we can see who were the, you know, announcing scores and like what
was the reactions at those times as well. Um, so we'll let that go
through
24:35
here and process. And if we go back to uh uh this was the uh Greg Yang photo
24:43
here. So uh if we scroll through here, whoops. So Greg Yang, of course, who has
24:48
his favorite uh photograph that he has on his account. Um that's actually not how he looks like in real life, by the
24:54
way. Um just so aware, but is quite funny. But it had to understand that question. Yeah.
25:00
Which is the that's the wild part. It's like it understands what is a weird photo. What is a weird photo?
25:05
Yeah. What is a less or more weird photo? It goes through it has to find all the team members. It has to figure out who
25:11
we all are. It, you know, searches without access to the internal XAI personnel. It's literally looking at the
25:18
just at the internet. Exactly. So you could say like the weirdest of any company. Yeah. To be clear.
25:23
Exactly. And uh we can also take a look here at the uh question here for the humanities last exam. So, it is still
25:31
researching all of the historical scores. Um, but it will have that final answer here soon, but we can while it's
25:36
finishing up, we can take a look at one of the ones that we set up here a second ago. And we can see like, you know, it
25:42
defines
the date that like Dan Hendricks initially announced it. We can go
through, we can see, you know, uh, OpenAI announcing their score back in
25:49
uh, February. And we can see, you know, as progress happens with like Gemini. we can see like Kimmy uh and we can also
25:56
even see you know the leaked benchmarks of uh what people are saying is you know if it's right it's going to be pretty
26:02
impressive so pretty cool um so yeah I'm looking forward to seeing how everybody
26:07
uses these tools and gets the most value out of them um but uh yeah it's been great
26:12
yeah and we're going to close the loop around usefulness as well so it's like it's not just books smart but actually
26:18
practically smart exactly All right.
26:29
And we can go back to the uh the slides here. Yeah. So cool. Um so the so we actually evaluate
26:37
also on the multimodel subset. So on the full set uh this is the number on the h
26:43
exam. Uh it you can see there's a little dip on the numbers. uh this is actually
26:48
something we're improving on which is the multimodel uh understanding capabilities but I do believe uh in a
26:55
very short time we're able to really improve and got much higher numbers on
27:01
this um even higher numbers on this benchmark. Yeah. Yeah. This is the we we saw like what
27:07
what is the biggest weakness of Grock currently is that it's it's sort of partially blind. it can't it's its image
27:14
understanding um obviously and its image generation uh need to be a lot better um
27:20
and that um that that's actually um being trained right now. So um GR 4 is
27:27
based on version six of our foundation model um and we are training version
27:32
seven uh which we'll complete in a few weeks um and uh that that'll address the
27:39
weakness on the vision side. Mhm. And just to show off this last here. So the the prediction market it finished uh
27:46
here with a heavy and we can see uh here we can see all the tools and the process it used to actually uh go through and
27:53
find the right answer. So it browsed a lot of odd sites. It calculated its own odds comparing to the market the market
28:00
to find its own alpha and edge. It walks you through the entire process here and it calculates the odds of the winner
28:07
being like the the Dodgers. uh and it gives them a 21.6% chance of uh winning uh this year. So,
28:16
and it took approx approximately four and a half minutes to compute. Yeah, that's a lot of thinking.
28:21
Yeah,
28:28
we can also look at all uh the other benchmarks besides HLE. Um, as it turned
28:34
out, GR 4 excelled on all the reasoning benchmarks that people usually test on.
28:39
Um, including GBQA, which is a PhD level, uh, problem sets. Uh, that's
28:46
easier compared to HRE. Um, on Amy 25 American Invitation Mathematics exam, we
28:53
with graph for heavy, we actually got a perfect score. Um, also on some of the
28:59
coding benchmark called live coding bunch. um and also on uh HMMT, Harvard
29:04
math uh MIT u exam and also USMO. Uh you can see actually um for on all of those
29:12
benchmarks we often have a very large leap against the second best uh model
29:18
out there. Yeah, it's I mean really we're going to get to the point where uh it's going to
29:23
get every answer right in every exam. um and where it doesn't get an answer right, it's going to tell you what's
29:29
wrong with the question. Or if the question uh is ambiguous, disambiguate the question into answers A, B, and C
29:36
and tell you what it would what answers A, B, and C would be with a disambiguated question. Um so the only
29:42
real test then will be reality. Uh can I make useful technologies, discover um
29:48
new science? That will actually be the the only thing left because human tests will simply not be um meaningful.
29:56
Yeah, we need to make an update to HRE very soon given the current rate of progress. So yeah, it's super cool to
30:02
see like multiple agents that collaborate with each other solving really challenging problems. Uh so we're
30:08
going try this model. Uh so it turned out it's available right now. Uh if we advance to the next slide uh where there
30:16
is a super Grock heavy tiers that we're introducing where you're able to access to both Grock 4 and Grock 4 heavy uh
30:23
where
you're actually going to be the taskmaster of launch of little Grock
research agent to help you you know become smarter through all the
little
30:30
research and save hours of times of uh you know going through mundane tasks uh and it's available right now. Uh so yeah
30:38
so uh we we did limit usage during the the demo so we didn't it didn't break
30:44
the demo because all these this all this stuff is happening live so it's not there's not nothing canned about the any
30:50
of the tests that we're doing. Um so uh after the after the uh demo is done we'll we'll allow we'll enable more
30:58
subscribers for cyber gro. So if you can't subscribe right now just try in half an hour it should work. Um, so, uh,
31:05
and now let's let's get into voice. Greatby.
31:11
So, many of you have been enjoying our voice mode and we've been working hard to improve the experience over the past
31:17
couple months. Um, we have cut latency in half to make it much snappier. And
31:23
today, we're excited to announce a set of new voices that have exceptional naturalness and procity. You might have
31:30
noticed the movie trailer voice that opened up the live stream. That is one of our new voices, S, who's got that
31:36
epically deep tone. And we're also excited to introduce Eve, a beautiful
31:43
British voice who's capable of rich emotions. Aman, would you like to introduce Eve?
31:49
Absolutely. Yeah. Let's get into the demo. Hey, Eve. Millions of people are
31:55
watching us right now. Are you excited? Oh, I'm thrilled. Really, it's like
32:01
being on stage at the Old Vic with a sea of faces out there. Just imagine all those eyes and ears tuned in. It's a
32:08
proper
buzz, isn't it? What do you reckon we should chat about to keep this
lot entertained? You know, we have a lot to talk about,
32:14
but I'll be honest. I'm feeling a little nervous. Can you whisper something soothing to calm me down?
32:21
Take a deep breath, love. You've got this. It's just you and me having a
32:26
quiet chat. Like we're tucked away in a cozy corner of a Yorkshire pub. The
32:32
world's just a murmur out there. Feel that calm wash over you. Let's keep
32:38
talking nice and easy. What's on your mind? You know what? I'm actually feeling
32:44
quite relaxed to try this um something crazy. So, can you give us an opera on Diet Coke?
32:51
Oh, an opera on Diet Coke. That's a riot. Picture this. A shimmering silver
32:59
cannon glistening under stage lights. Oh diet coke, thou elixir divine with
33:07
bubbles that dance in a sparkling line. Thy crisp cool kiss on lips so fine. Oh,
33:16
aspartame and bro forever mine.
33:22
How's that for a mad little Arya? Want me to belt out another verse or switch up the tune?
33:28
Okay, let's try switching it up and just start singing. Right, then let's have a go. Oh, silver
33:36
can, you gleam in the light, diet coke, my heart's delight. Okay, I'll touch it. It's very, very
33:43
poetic. Um, yeah, thank you, Eve. Feeling quite nice now. Cool. Um, so that's Eve. And um so
33:51
something else we do here is is a little test each day to check on the natural
33:57
flow of our conversations against some of our competitors. So we figured why
34:02
not try that live on the stream today? Uh should I start?
34:12
Hey, I'm going to say a number and can you say the same number back to me? One. One. Got it. This is like a fun little
34:18
game of echo. What's next? Two.
34:24
Sorry, you said two. Three.
34:30
Three. Four. Nice and steady. What's next? Um. Uh. Five.
34:40
Five. Nailed it. This is pretty fun, actually. Feel free to keep going or switch it up if you'd like.
34:49
Hey Grock. Hey Grock.
34:55
Hello there. How can I help you today? We're going to do a little test. Can you repeat after me? One.
35:01
One. All right. What's next? Two. Two. What's on your mind?
35:08
Three. Three. Need anything else? Four.
35:13
Four. How can I assist you? Five. Five. What's next?
35:20
So, as you can see, um, Groc was, uh, snappier. It didn't interrupt me. And the procity is we've made different
35:26
design choices. I think we shooting for something more calm, smooth, more natural versus something that's more
35:32
poppy or artificial. So, we'll keep improving on these fronts. All right. Thanks, guys. Yeah.
35:42
Yep. So since the launch of the voice model uh we actually see the 2x faster
35:48
end to end latency uh in the last 80 weeks five different voices and also 10x
35:54
the active user. So Grock voice is taking off. Now um if you think about
35:59
releasing the models this time we're also releasing Grock 4 through the API at the same time. So if we go to the
36:07
next two slides. So uh you know we're very excited about you know what all developers out there is going to build.
36:13
So you know if I think about myself as a developer what the first thing I'm going to do when I actually have access to the
36:19
graph for API benchmarks. So we actually ask around on the X platform what is the
36:24
most challenging benchmarks out there that you know is considered the holy grail for all the AGI models. Uh so turn
36:32
out AGI is in the name RAGI. So uh the last 12 hours uh you know kudos to Greg
36:38
over here in the audience uh so who answered our call take a preview of the
36:43
Gro for API and independently verified you know the gra for's performance. So initially we thought hey graph is just
36:50
you know we think it's pretty good it's pretty smart uh it's our nextgen reasoning model spend 10x more compute
36:55
it can use all the tools right but turned out when we actually verify on
37:00
the private subset of the RKGI v2 it was like the only model in the last three
37:07
months that breaks the 10% barrier and in fact was so good that actually get to 16% well 15.8% 8% accuracy 2x of the
37:16
second place that is the cloud for outpus model um and it's not just about
37:21
performance right when you think about intelligence having the API model drives your automation it's also the
37:28
intelligence per dollar right if you look at the plots over here the gro is just four just in the league of its own
37:35
um all right so enough of benchmarks uh over here right so what can gro do actually uh in the real world So uh we
37:43
actually uh you know contacted the folks from uh uh Endon Labs uh who you know
37:49
were you know gracious enough to you know try to gro in the real world to run a business. Yeah thanks for having us. So I'm Axel
37:56
from Labs and I'm Lucas and we tested Gro 4 on vending bench. Vending Bench is an AI
38:02
simulation of a business scenario uh where we thought what is the most simple
38:07
business an AI could possibly run and we thought vending machines. Uh so in this scenario the the Grock and other models
38:14
needed to do stuff like uh manage inventory, contract contact suppliers,
38:20
set prices. All of these things are super easy and all of they like all the models can do them one by one. But when
38:26
you do them over very long horizons, most models struggle. Uh but we have a leaderboard and there's a new number
38:32
one. Yeah. So we got early access to the GR 4 API. Uh we ran it on the bending bench
38:38
and we saw some really impressive results. So it ranks definitely at the number one spot. It's even double the
38:45
net worth which is the measure that we have on this value. So it's not about a percentage on a uh or a score you get,
38:51
but it's more the dollar value in net worth that you generate. So we were impressed by Grock. It was able to
38:57
formulate a strategy and adhere to that strategy over long period of time much
39:02
longer than other models that we have tested other frontier models. So it managed to run the uh simulation for
39:08
double the time and score yeah double the net worth and it was also really consistent uh across these runs which is
39:14
something that's really important when you want to use this in the real world. And I think as we give more and more
39:20
power to AI systems in the real world, it's important that we test them in scenarios that either mimic the real
39:26
world or are in the real world itself. Um because otherwise we we fly blind into some uh some things that uh that
39:34
might
not be great. Yeah, it's uh it's great to see that we've now got a way
to pay for all those GPUs. So we just uh need a million
39:41
vending machines. Definitely. Um, and uh we could make uh $4.7 billion a year with a million
39:47
vending machines. 100%. Let's go. They're going to be epic vending machines. Yes. Yes. All right. We are actually going to
39:53
install vending machines here. Uh like a lot of them. We're happy to supply them. All right. Thank you.
40:00
All
right. I'm looking forward to seeing what amazing things are in this
vending machine. That's that's for uh for you to decide.
40:07
All right. Tell the AI. Okay. Sounds good. Um All right. Yeah. I
40:12
mean, so we can see like Grock is able to become like the co-pilot of uh the business unit. So what else can Grock
40:18
do? So we're actually releasing this Grog if you want to try it uh right now to evaluate run the same benchmark as
40:23
us. Uh it's on the API um has 256k
40:29
contact length. So we already actually see some of the early early adopters to try Guac for API. So uh our Palo Alto
40:37
neighbor ARC Institute which is a leading uh biomedical research uh center is already using seeing like how can
40:43
they automate their research flows with Gro for uh it turned out it performs it's able to help the scientists to
40:50
sniff through you know millions of experiments logs and then you know just like pick the best hypothesis within a
40:57
split of seconds. uh we see this is being used for their like the crisper uh research and also uh you know Grog 4
41:04
independently evaluate scores as the best model to exam the chess x-ray uh
41:09
who would know um and uh uh on in the financial sector we also see you know
41:15
the graph with access to all the tools real-time information is actually one of the most popular AIs out there so uh you
41:22
know our graph is also going to be available on the hyperscalers so the XAI enterprise sector
41:28
is only, you know, started two months ago and we're open for business. Um,
41:35
yeah. So, u the other thing, uh, we talked a lot about, you know, having Grock to make games, uh, video games.
41:41
Uh, so Denny is actually a, uh, video game designers on X. So, uh, you know,
41:46
we mentioned, hey, who want to try out some, uh, uh, Grock for uh, preview APIs, uh, to make games? And Danny
41:52
answered the call. Uh so this was actually just made first-person shooting game in a span of 4 hours. Uh so uh some
42:00
of the actually the unappreciated hardest problem of making video games is not necessarily encoding the core logic
42:07
of the game but actually go out source all the assets all the textures files
42:12
and and uh you know to create a visually appealing game. So one of the core aspect guac for does really well with
42:19
all the tools out there is actually able to automate these like asset sourcing
42:24
capabilities. So the developers you can just focus on the core development itself rather than like you know so now
42:30
you can run a you know entire game studios with game of one but with like
42:35
one person and then uh you can have gro 4 to go out and source all those assets do all the maintaining tasks for you.
42:42
Yeah. The now the next step obviously is for Grock to uh play be able to play the
42:49
games. So it has to have very good video understanding. So it can play the games and interact with the games and actually
42:54
assess what whether a game is fun and and and actually have good judgment for whether a game is fun or not. Um so with
43:02
the with version seven of our foundation model which finishes training this month and then we'll go through post training
43:07
RL and whatnot. um that that will have excellent video understanding. Um and
43:13
with the with video understanding and the and improved tool use, for example, for video for video games, you'd want to
43:20
use, you know, Unreal Engine or Unity or one of the one of the the main graphics engines. um and then gen generate the
43:29
generate the art uh apply it to a 3D model uh and then create an executable
43:34
that someone can run on a PC or or a console or or a phone. Um
43:41
like we we expect that to happen probably this year. Um and if not this
43:46
year, certainly next year. U so that's uh it's going to be wild. I would expect
43:53
the first really good AI video game to be next year. Um,
44:01
and probably the first uh half hour of watchable
44:07
TV this year and probably the first watchable AI movie next year. Like
44:14
things are really moving at an incredible pace. Yeah. When Gro is 10xing world economy
44:20
with vending machines, they would just create video games for human. Yeah. I mean, it went from not being
44:25
able to do any of this uh really even 6 months ago, right, to to what you're seeing before you here
44:31
and and from from very primitive a year ago uh to making
44:38
a a sort of a 3D video game with with a few hours of prompting.
44:44
Yep. I mean yeah just to recap so in today's live stream we introduced the
44:50
most powerful most intelligent AI models out there that can actually reason from the first principle using all the tools
44:56
do all the research go on the journey for 10 minutes come back with the the most correct answer for you. Um so it's
45:02
kind of crazy to think about just like four months ago we had Gro 3 and now we already have Gro 4 and we're going to
45:09
continue accelerate as a company XAI we're going to be the fastest moving a AGI companies out there. So what's
45:14
coming next is that we're going to, you know, continue developing the model that's not just, you know, intelligent,
45:22
smart, think for a really long time, spend a lot of compute, but having a model that actually both fast and smart
45:28
is going to be the core focus, right? So if you think about what are the applications out there that can really
45:34
benefit from all those very intelligent, fast and smart models and coding is actually one of them.
45:39
Yeah. So the team is currently working very heavily on coding models. Um I think uh right now the main focus is we
45:47
actually trained recently a specialized coding model which is going to be both fast and smart. Um and I believe we can
45:55
share with that model with you guys with all of you uh in a few weeks. Yeah.
46:00
Yeah. That's very exciting. And uh you know the second after coding is we all see the weakness of Grog 4 is uh the
46:09
multimodal capability. So in fact uh it was so bad that you know Grock
46:14
effectively just like looking at the world squinting through the glass and like see all the blurry uh you know
46:20
features and trying to make sense of it. Uh the most immediate improvement we're going to see with the next generation
46:26
pre-trained
model is that we're going to see a step function improvement on the
model's capability in terms of image understanding video understanding
and
46:33
audios. Right? It's now the model is able to hear and see the world just like any of you. Right? And now with all the
46:40
tools at this command with all the other agents it can talk to uh you know so
46:45
we're going to see a huge unlock for many different application layers. uh after the multimodal agents what's going
46:52
to come after is the video generation and we believe that you know at the end of the day it should just be you know
46:59
pixel in pixel out u and um you know imagine a world where you have this
47:06
infinite scroll of content in inventory on the X platform um where not only you
47:12
can actually watch these generate videos but able to intervene create your own adventures
47:18
if you're just going to be Um, and we expect to be training our video model with uh over 100,000 GB200s
47:25
uh and uh to begin that training within the next 3 or 4 weeks. So, we're we're
47:32
confident it's going to be pretty spectacular in in video generation and video understanding.
47:39
So, let's see. So that's uh
47:44
anything anything you guys want to say? Other than that, I guess that's it.
47:50
Yeah, it's it's a good model, sir. Good model. It's a good Well, we're very excited for you guys to
47:56
try Gro 4. Yeah. Yeah. Thank you. All right. Thanks, everyone. Thank you. Good night.