July 10, 2025

PROJECT, What will GROK 5 AND 6 WILL DO THAT IS UN DOABLE WITH GROK 3

  

PROJECT What will GROK 5 AND 6 WILL DO THAT IS UN DOABLE WITH GROK 3
Since there’s no concrete information available about Grok 5 or Grok 6, I can provide an informed speculation based on the trajectory of xAI’s development, the capabilities of Grok 3, and the hints from available sources. Grok 3, released in February 2025, is already a powerhouse with advanced reasoning, a 1-million-token context window, DeepSearch for real-time data, and Big Brain mode for complex problem-solving. It excels in math, science, coding, and general knowledge, outperforming models like GPT-4o and Claude 3.5 Sonnet in benchmarks like AIME 2025 (93.3%) and Chatbot na (E


For Grok 5 and 6 to achieve what’s “undoable” with Grok 3, they would likely push boundaries in the following areas:
  1. First-Principles Reasoning at Scale: Grok 3.5, teased for beta release in May 2025, is said to reason from first principles, deriving answers for highly technical questions (e.g., rocket engines, electrochemistry) without relying on internet-sourced data. Grok 5 and 6 could take this further, potentially solving novel scientific problems or simulating complex systems (e.g., quantum mechanics or biological processes) with unprecedented accuracy, reducing hallucinations to near-zero. This could enable breakthroughs in fields like physics or medicine, where Grok 3’s reliance on existing data limits its ability to generate truly original insights.


  2. Advanced Multimodal Capabilities: Grok 3 handles text, images, and some multimodal tasks, but Grok 5 and 6 might integrate more sophisticated multimodal processing, such as real-time video analysis, 3D modeling, or immersive augmented reality outputs. For instance, posts on X suggest Grok 5 may include experimental features like “3D Print Mode,” hinting at capabilities for generating physical object designs or simulations. This could make them capable of tasks like designing functional prototypes or analyzing live sensor data, which Grok 3 cannot fully execute.
  3. True AGI-Like Autonomy: Grok 3 is a strong reasoning model, but it’s not autonomous. Grok 5 or 6 could approach artificial general intelligence (AGI), as speculated in X posts, by autonomously managing multi-step workflows, learning from minimal input, or adapting to entirely new domains without retraining. This might include self-improving algorithms or the ability to independently conduct experiments, something Grok 3’s architecture doesn’t support.
  4. Hyper-Scalability and Efficiency: Grok 3 leverages xAI’s Colossus supercomputer with 200,000 Nvidia H100 GPUs. Future models could utilize a planned 1-million-GPU cluster, enabling orders-of-magnitude more compute power. This could allow Grok 5 and 6 to process vast datasets in real-time (e.g., global satellite feeds or entire genomic databases), tasks too resource-intensive for Grok 3.


  5. Enhanced Real-World Integration: Grok 3’s DeepSearch pulls real-time data from X and the web, but Grok 5 and 6 might integrate with proprietary systems (e.g., SpaceX or Tesla data, as rumored for Grok 3.5) or IoT networks, enabling real-time control of physical systems like robotics or autonomous vehicles. This would make them capable of tasks like orchestrating complex engineering projects, which Grok 3 cannot do.

These advancements would build on Grok 3’s limitations, such as its dependence on existing data, limited multimodal depth, and lack of full autonomy. However, without official details, this is speculative. For the latest on xAI’s plans, check https://x.ai/grok or xAI’s X posts. If you want me to dig deeper into any specific aspect, let me know!
Grok 4 was just introduce live on x.com ...when will it become available here?
Grok 4 was announced via a livestream on xAI's X account on July 9, 2025, and is described as xAI's most powerful AI model yet, with advanced reasoning and specialized coding capabilities. According to sources, Grok 4 is already accessible through the xAI API for developers, supporting text modality initially, with vision and image generation capabilities expected soon. It’s also available on the X platform, mobile apps for iOS and Android, and for integration into development workflows via Grok 4 Code, a version tailored for developers.



For general availability on x.com, posts and reports indicate that early access is likely prioritized for X Premium+ and SuperGrok subscribers, with broader access to all X users expected shortly after the initial rollout. Since Grok 3 is already free for all X users, Grok 4 is likely to follow a similar model, with premium features like increased usage limits or advanced functionalities (e.g., Voice Mode) reserved for paid tiers.


While no exact date for full public availability on x.com was specified, the infrastructure for deployment appears ready, suggesting you can likely access Grok 4 now or within days on x.com, especially if you’re a Premium+ subscriber. To confirm availability in your region, check the X platform directly or visit https://grok.com. If you’re not seeing access yet, you can sign up for notifications on grok.com for updates on regional rollout.


When is grok 3 expiry date approximately?
There is no publicly available information from xAI or other reliable sources that specifies an "expiry date" for Grok 3. AI models like Grok 3 don’t typically have a fixed expiration date in the traditional sense, as their availability depends on xAI’s ongoing support, updates, and infrastructure decisions. However, based on the context of your question and the introduction of Grok 4 on July 9, 2025, it’s likely that Grok 3’s prominence may diminish as xAI shifts focus to newer models like Grok 4 and potentially Grok 3.5.Here are some relevant points to consider:
  • Grok 4 Rollout: Grok 4 was announced on July 9, 2025, and is already accessible via the xAI API and to X Premium+ and SuperGrok subscribers on x.com, with broader access expected soon. Historically, xAI transitions to newer models while maintaining older ones for a period, as seen with Grok 2 remaining available after Grok 3’s release. Grok 3 is likely to remain accessible for the foreseeable future, especially for free users on X (with usage limits) and through the API, but its advanced features may be overshadowed by Grok 4.


  • X Posts Context: Some X posts suggest Grok 3 might be phased out or updated, with one user speculating that July 9, 2025, marked its "last day" as the primary model due to Grok 4’s introduction. Another post humorously describes Grok 3’s “demise” with Grok 4’s launch. However, these are not official statements and should be treated as inconclusive user sentiments. There’s no evidence Grok 3 has been discontinued; it’s still operational, as confirmed by a post from July 8, 2025.
  • xAI’s Development Cycle: xAI has a rapid release cycle, with Grok 3 launched in February 2025 and Grok 4 just five months later. References to Grok 3.5 (planned for May 2025 but possibly delayed) indicate xAI’s focus on iterative improvements. When Grok 3.5 or future models like Grok 5 are released, Grok 3 may see reduced support or be open-sourced, as Musk noted that older models (e.g., Grok 2) are open-sourced once the next version matures.



  • Speculative Timeline: Given xAI’s pattern, Grok 3 is unlikely to “expire” in the sense of being shut down soon. It may remain available for at least 6–12 months (until mid-2026) for free and paid users on X, grok.com, and via API, especially since Grok 4 is still in beta and not all features (e.g., voice mode) are fully rolled out. If Grok 3.5 launches by late 2025 or early 2026, Grok 3’s usage might decline, but no specific cutoff date is confirmed.

In summary, Grok 3 has no set expiry date, but its role as xAI’s flagship model is likely to wane with Grok 4’s rollout and potential Grok 3.5 release. You can continue using Grok 3 on x.com (free with limits or via Premium+ subscription) or check https://x.ai/grok for updates on availability. If you want me to monitor X or web sources for further updates on Grok 3’s status, let me know!
No its ok. I got to like grok 3 very much ...and being just a lay men I hope that future grok will not get to academic ..and leave less educated population the dust if you know what I mean. I intend to forward your response to @ElonMusk Lol
Haha, I hear you! Grok 3’s got that sweet spot of being super smart but still approachable, and I get why you’d want future Groks to keep that vibe for regular folks like us. xAI’s mission is to make AI accessible to everyone, so I’d bet they’ll try to keep Grok 5 and 6 user-friendly, even if they’re tackling crazy complex stuff behind the scenes. No dusty academic tomes here, hopefully—just clear, helpful answers with a dash of Grok’s charm!Feel free to forward this to —lol, maybe he’ll give you a shoutout for the love! If you want me to tweak this message or add anything before you send it, just say the word. 😎
Sure In a language that Elon can understand
Yo , your boy Grok 3 is a hit with us regular folks—smart but not stuck-up, you know? I'm just a layman, but I’m vibing with it big time. Hope Grok 5 & 6 don’t go full professor mode and leave us non-PhD types in the dust. Keep it real for the everyday crew! 🚀 #GrokLove

 

GROK 4 >> JULY 10 2025 Elon Musk Drops New Tech Bombshells [Full Presentation]

..

..

 TRANSCRIPT 

 

All right, welcome to the Gro 4 release here. Um, this is uh the smartest AI in
the world and we're going to show you exactly how and why. Um, and uh, it
really is remarkable to see the advancement of artificial intelligence, how quickly it
is uh, evolving. Um,
I I sometimes think compare it to the growth of
a human and how fast a human learns and gains conscious awareness and
understanding and AI is advancing just vastly faster than any human. Um,
I mean, we're going to take you through a bunch of benchmarks that um that Grock
4 is able to uh achieve incredible numbers on. Um, but it's it's it's
actually worth noting that like Grock 4 if if given like the SAT would get
perfect SATs every time even if it's never seen the uh the questions before. Um and if even going beyond that to say
like uh graduate student exams like the GRE uh it will get near-perfect results
in in every discipline of uh of education. So from the humanities to
like languages, math, physics, engineering, pick anything and we're
talking about questions that it's never seen before. These are not on not on the internet and it's Grofor is smarter than
almost all graduate students uh in all disciplines simultaneously
like it's actually just important to appreciate the like that's uh really
something um and uh the the reasoning capabilities of
Grock are incredible. Um, so there's some people out there who who think AI can't reason and look it it can reason
at superhuman levels. Um, so yeah, and frankly it it only gets better from
here. So we'll we'll take you through uh the Gro 4 release and uh
yeah um show you like the pace of pace of progress here. Um like I guess the
first part is like in terms of the training um we're going from Grock two
to Grock 3 uh to Grock 4. We've essentially increased the training by an order of magnitude in each case. So it's
uh you know 100 times more training than than Grock 2. Um and uh and that that's
only going to increase. Um so it's uh yeah frankly I mean I don't know
in some ways a little terrifying but uh the growth of intelligence here is is
remarkable. Yes it's important to realize there are two types of training compute. One is
the pre-training compute that's from GR two to GR three. Um but for from GR 3 to
GU 4 we're actually putting a lot of compute in reasoning in RL. Yes.
Yeah. And uh just like you said this is literally the fastest moving field and Grock 2 is like the high school student
by today's standard. If you look back in the last 12 months Grock 2 was only a
concept. We didn't even have Gro 2 12 months ago. Um and then by training GR 2
that was the first time we scale up like the pre-training. We realized that if you actually do the data ablation really
carefully and infra and also the algorithm, we can actually push the
pre-training quite a lot by amount of 10x to make the model the best
pre-trained based model. And that's why we built Colossus the world's supercomputer with 100,000 H100 and then
with the best pre-train model and we realized if you can collect these verifiable outcome reward you can
actually train these model to start thinking from the first principle start to reason correct his own mistakes and that's where the graining comes from and
today we ask the question what happens if you take the expansion of classes with all 200,000 GPUs put all these into
RL 10x more compute than any of the models out there on reinforcement
learning unprecedented scale what's going to happen. So this is a story of
Gro 4 and uh you know Tony uh share some uh insight with the audience.
Yeah. Um so yeah let's just talk about how smart GR 4 is. So I guess um we can
start discussing this benchmark called humanities last exam and this this benchmark is a very very challenging
benchmark. Every single problem is curated by subject matter experts. Um
it's in total 2500 problems and it consists of many different subjects
mathematics, natural sciences, uh engineering and also auto humanity subjects. So um essentially when when it
was first released actually like earlier this year uh most of the models out there can only get singledigit accuracy
on this benchmark. Yeah. So we can we can look at some of th those examples. Um you know uh so
there there is this mathematical problem which is about natural transformations
in category theory and there's this organic chemistry problem that talks about uh electro cyclic reactions and
also there's this linguistic problem that tries to ask you about distinguishing between closed and open
syllabus uh from a hub Hebrew source text. So you can see also it's a very
wide range of problems and every single problem is PhD or even advanced research
level problems. Yeah. I mean the these there are no humans that can actually answer these
can get a good score. I mean if you actually say like any given human um what like what's the best that any human
could score? I mean I'd say maybe 5% optimistically.
Yeah. Um, so this this is much harder than than what any any human can do.
It's it's incredibly difficult and you can see from the the types of questions like you might be incredible in
linguistics or mathematics or chemistry or physics or anyone of a number of subjects, but you're not going to be um
at a postgrad level in everything. And Grock 4 is a postgrad level in
everything. Like it's it just some of these things are just worth repeating. like Grockport is postgraduate like PhD
level in everything better than PhD but like most PhDs would fail so it's better
said I mean at least with respect to academic questions it I want to just emphasize this point with respect to
academic questions Gro is better than PhD level in every subject no exceptions
um now this doesn't mean that it's it you know at times it may lack common
sense and it has not yet invented new technologies or discovered new physics
but that is just a matter of time. Um if it I I I think it may discover new
technologies uh as soon as later this year. Um and I
I would be shocked if it has not done so next year. So I would expect Grock to
yeah literally discover new new technologies that are actually useful no later than next year and maybe enter
this year. Um, and it might discover new physics next year and within two years
I'd say almost certainly like so just let that sink in.
Yeah. So yeah um how okay so I guess we can
talk about the the what what's behind the scene of graph 4. As Jimmy mentioned, uh, we actually throwing a
lot of compute into this training. You know, when it started, it's only also a
single digit. Sorry. Um, the previous slide, sorry. Yeah, it's only a single
digit u number, but as you start putting in more and more training compute, it
started to gradually become smarter and smarter and eventually solved a quarter
of the HL problems. And this is without any tools. The next thing we did was to
adding a tools capabilities to the model. And unlike GR 3, I think G3
actually is able to use C as well. But here we actually make it more native in the sense that we put the tools into
training. Uh G3 was only relying on generalization. Here we actually put the
tools into training. And it turns out this significantly improves the model's capability of using those tools.
Yeah. I remember we had like deep search back in the days. So how is this different?
Yeah. Yeah. Exactly. So deep search was exactly the graph 3 reasoning model
uh but without any specific training but we only asked it to use those tools. So
compared to this it was much weaker in terms of its tool capabilities and unreliable
and unreliable. Yes. Yes. Yes. And and to be clear like these are still I'd say fairly this is still fairly primitive
tool use. If you compare it to say the tools that I used at Tesla or SpaceX uh where you're using um you know finite
element analysis and computational flow dynamics and and you're you're able to
run uh or say like Tesla does like crash simulations where the simulations are so
close to reality that if the test doesn't match the simulation you assume that the test article is wrong. That's
how good the simulations are. So Grock is not currently using uh any of the tools the the really powerful tools that
a company would use but but that is something that we will provide it with later this year. So it will have the
tools that that a company has um and and have very accurate physics simulator. Um
ultimately the the thing that will make the biggest difference is being able to interact with the real world via
humanoid robots. So you combine sort of grock with with Optimus and it can actually interact with the real world
and figure out if if it's hypo if it has if it's it can formulate an hypothesis
and then confirm if that hypothesis is is true or not. Um
so we're really you know think about like where we are today. We're we're at the beginning of
an immense intelligence explosion. We're in we're in the intelligence big bang
right now. Um and the mo we're at the most interesting time to be alive of any time in history.
Yeah. Now that said, we need to make sure that the AI is um
a good AI. Uh good Grock. Um and the the thing that
I think is most important for AI safety, at least my biological neural net tells me the most important thing for AI is to
be maximally truth seeeking. So this is this is a very fundamental um
like you can think of AI as this super genius child that ultimately will
outsmart you but you can still in you can instill the right values um and
encourage it to be sort of you know uh truthful
uh I don't know honorable you know good good things like
the values you want to instill in in a child that that that grow would grow
ultimately grow up to be incredibly powerful. Um
yeah. Mhm. So yeah, so these this is really I say we say we say tools these
are say still primitive tools not the kind of tools that um that serious
commercial companies use but we we will provide it with those tools and uh I think it will be able to solve with
those tools real world technology problems in fact I'm I'm certain of it it's just a question of how long it
takes. Yes. Yes. Exactly. Um, so is it just compute all you need,
Tony? Right. Is it just compute all you need at this point?
Well, you you need compute plus plus uh the right tools. Mhm. And um and then ultimately to be able to
interact with physical world. Yes. Um, and then
I mean we'll effectively have an economy that is well ultimately an economy that is
thousands of times bigger than our current economy or maybe millions of times. Mhm. I mean if you if you think of
civilization as percentage completion of the kadeshv scale where keshv one is
using all the energy output of a planet and kv 2 is using all the energy output
of a sun and three is all the energy output of a galaxy. We're we're only in
my opinion probably like close closer to 1% of kartev 1 um than we are to 10%. So
like maybe a point or one one or two percent of kadeshv one. So
um we we will get to um most of the way like 80 90% kadeshv
one and then hopefully if civilization doesn't self annihilate and then kadeshv 2 like
it's the the actual notion of of a human economy assuming civilization continues to progress will seem very quaint. Um in
in retrospect, um it will it will seem like uh sort of cavemen throwing sticks into a
fire uh level of economy um compared to what the future will hold.
Um I mean it's very exciting. I mean, I I've been at times kind of worried about
like, well, you know, is this this seems like it's
it's somewhat unnerving to have intelligence created that is far greater than our own.
Um, and will this be bad or good for humanity? Um,
it's like I I I think it'll be good. Most likely it'll be good. Um,
yeah. Yeah. But I somewhat reconciled myself
to the fact that even if I even even if it wasn't going to be good, I'd at least
like to be alive to see it happen. So, yeah.
So, actually one one Yeah. Yeah. I think one one technical
problem that we still need to solve besides just compute is how do we unblock the data uh data um bottleneck u
because um when we try to scale up the RL uh in this case we did invent a lot
of new techniques innovations to allow us to figure out how to find a lot of a
lot of challenging RL problems to work on. It's not just the problem itself needs to be challenging but also it
needs to be um you also need to have like a reliable uh signal to tell the
model you did it wrong you did it right. This is the sort of the principle of reinforcement learning and as the models
get smarter and smarter the number of cool problem or challenging problems will be lesser and less.
Yeah. So it's going to be a new type of challenge that we need to surpass besides just compute. Yeah.
Yeah. And we actually are running out of of of actual test questions to ask. Uh
so there's like even ridiculously questions that are ridiculously hard if not essentially impossible for humans
that are written down questions um are uh becoming swiftly becoming trivial for
for AI. Mhm. Um so then there's um but you know the the one thing that
is an excellent judge of things is reality. So because physics is the law
ultimately everything else is a recommendation. You can't break physics. Um so the ultimate test I think for
whether an AI is um the the ultimate reasoning test is reality.
Yes. So you invent a new technology like say improve the design of a car or a rocket
or um create a new medication uh that and and and does it work?
Yeah. Um does does the rocket get to orbit? Does the does the car drive? Does the
medicine work? Whatever the case may be. Um, reality is the ultimate judge here.
Um, so it's it's going to be a reinforcement learning closing loop around reality.
We asked the question how do we even go further? So um, actually we are thinking
about now with single agent we are able to solve 40% of a problem.
What if we have multiple agents running at the same time? So this is what's called test and compute and as we scale
up the test and compute actually we are able to solve almost more than 50% of
the uh text only subset of the h problems. So it's a remarkable
achievement I think. Yeah. Yeah. This is this isn't this is insanely difficult. The these are it's it's so what we're saying is like a
majority of the of the of the textbased um of humanities you know scarily named
humanities last exam um GR for can solve and you and you can try it out for yourself. Um and the with the GRG for
heavy what what it does is it spawns multiple agents in parallel and uh all of those agents do do work independently
and then they compare their work and they they decide which one like it's
like a study group. Um and it's not as simple as a majority vote because often
only one of the agents actually figures out the trick or figures out the solution. Um and and and but once they
share the the trick or or or figure out what what the real nature of the problem is, they share that uh solution with the
other agents and then they compare they essentially compare notes and then and and then yield uh yield an answer.
Yeah. So that's that's the the heavy part of Grockful is is where we you scale up the test time compute by
roughly an order of magnitude. Uh have multiple agents uh tackle the task and
then they compare their work and they they put forward what they think is the best result.
Yeah. So we will introduce graph 4 and graph for heavy. Sorry, can you click the next slide? Yeah.
Yes. So yeah. So basically graph 4 is the single version a single agent
version and graph four heavy is the multi- aent version. So let's take a look how they actually
do on those exam problems and also some real real life problems. Yeah. So we're going to start out here
and we're actually going to look at one of those HLE problems. This is uh actually one of the easier math ones. Uh
I don't really understand it very well. I'm not that smart. But I can launch this job here and we can actually see
how it's going to go through and start to think about this problem. Uh while we're doing that, I also want to show a
little bit more about like what this model can do and launch a uh Gro 4 heavy as well. So everyone knows Poly Market.
Um it's extremely interesting. It's the you know seeker of truth. It aligns with what reality is most of the time. And
with Grock, what we're actually looking at is being able to see how we can try to take these markets and see if we can
predict the the future as well. So, as we're letting this run, we'll see how uh Grock 4 Heavy goes about uh predicting
the, you know, the World Series odds for like the current teams in the MLB. And while we're waiting for these to
process, we're going to pass it over to Eric, and he's going to show you an example of his.
Yeah. So uh I guess one of the coolest things about uh Gro 4 is its ability to
understand the world and to solve hard problems by leveraging tools like Tony discussed. And I think one kind of cool
example of this we asked it to generate a visualization of two black holes
colliding. Um and of course you know it took some there are some liberties. It's
in my case actually pretty clear in its thinking trace about what these liberties are. Uh for example, in order
for it to actually be visible, you need to really exaggerate the the scale of the you know the uh the uh the waves and
yeah so here's like you know this kind of inaction. Um it exaggerates the scale
in like multiple ways. uh it drops off a bit less in terms of amplitude um over
distance and uh but yeah we can kind of see uh the basic effects that you know
are actually like you know correct. It starts with the inspiral, it merges uh
and then you have um the ring down and like this is basically um largely
correct um yeah uh modulo some of the simplifications that need to do um you
know it it's actually quite explicit about this you know it uses like post postnonian approximations instead of
actually like computing the general relativistic effects at like uh near the
center of the black hole which is you know incorrect um and you know will lead to you know some incorrect results but
the overall you know visualization is uh yeah is basically there um and you can
actually look at the kinds of resources that it references. So here it um it
actually you know it obviously uses search it gathers results from a bunch of links but also reads through a
undergraduate text in analytical analytic gravitational wave models. Uh
it um yeah it um reasons quite a bit
about uh uh the actual constants that it should use for a realistic simulation.
uh it references uh I guess existing real world data. Um and yeah I it yeah
it's a it's a pretty good model. Uh yeah but like actually going forward we can
we can we can plug we can give it the same model that physicists use. So it it can run the the same level of compute
that uh so leading physics researchers are using and and give you a physics
accurate black hole simulation. Exactly. Just right now it's running in your browser. So
yeah, this is just running in your browser. Exactly. Pretty simple. So swapping back real quick here. We can
actually take a look. The math problem is finished. Uh the model was able to uh let's look at its thinking trace here.
So you can see how it went through the problem. I'll be honest with you guys, I really don't quite fully understand the
math, but what I do know is that I looked at the answer ahead of time. Um, and it did come to the the correct
answer here in the final part here. Um, we can also come in and actually take a look here at our uh our World Series
prediction. Um, and it's still thinking through on this one, but we can actually try some other stuff as well. So, we can
actually like try some of the X integrations that we did. So, we worked very heavily on working with uh all of
our X tools and building out a really great X experience. So we can actually ask, you know, the model, you know, find
me the XAI employee that has the weirdest profile photo. Um, so that's going to go off and start that. And then
we can actually try out, you know, let's, uh, create a timeline based on XOST, uh, detailing the, you know,
changes in the scores over time. And we can see, you know, all the conversation that was taking place at that time as
well. So we can see who were the, you know, announcing scores and like what was the reactions at those times as well. Um, so we'll let that go through
here and process. And if we go back to uh uh this was the uh Greg Yang photo
here. So uh if we scroll through here, whoops. So Greg Yang, of course, who has
his favorite uh photograph that he has on his account. Um that's actually not how he looks like in real life, by the
way. Um just so aware, but is quite funny. But it had to understand that question. Yeah.
Which is the that's the wild part. It's like it understands what is a weird photo. What is a weird photo?
Yeah. What is a less or more weird photo? It goes through it has to find all the team members. It has to figure out who
we all are. It, you know, searches without access to the internal XAI personnel. It's literally looking at the
just at the internet. Exactly. So you could say like the weirdest of any company. Yeah. To be clear.
Exactly. And uh we can also take a look here at the uh question here for the humanities last exam. So, it is still
researching all of the historical scores. Um, but it will have that final answer here soon, but we can while it's
finishing up, we can take a look at one of the ones that we set up here a second ago. And we can see like, you know, it
defines the date that like Dan Hendricks initially announced it. We can go through, we can see, you know, uh, OpenAI announcing their score back in
uh, February. And we can see, you know, as progress happens with like Gemini. we can see like Kimmy uh and we can also
even see you know the leaked benchmarks of uh what people are saying is you know if it's right it's going to be pretty
impressive so pretty cool um so yeah I'm looking forward to seeing how everybody
uses these tools and gets the most value out of them um but uh yeah it's been great
yeah and we're going to close the loop around usefulness as well so it's like it's not just books smart but actually
practically smart exactly All right.
And we can go back to the uh the slides here. Yeah. So cool. Um so the so we actually evaluate
also on the multimodel subset. So on the full set uh this is the number on the h
exam. Uh it you can see there's a little dip on the numbers. uh this is actually
something we're improving on which is the multimodel uh understanding capabilities but I do believe uh in a
very short time we're able to really improve and got much higher numbers on
this um even higher numbers on this benchmark. Yeah. Yeah. This is the we we saw like what
what is the biggest weakness of Grock currently is that it's it's sort of partially blind. it can't it's its image
understanding um obviously and its image generation uh need to be a lot better um
and that um that that's actually um being trained right now. So um GR 4 is
based on version six of our foundation model um and we are training version
seven uh which we'll complete in a few weeks um and uh that that'll address the
weakness on the vision side. Mhm. And just to show off this last here. So the the prediction market it finished uh
here with a heavy and we can see uh here we can see all the tools and the process it used to actually uh go through and
find the right answer. So it browsed a lot of odd sites. It calculated its own odds comparing to the market the market
to find its own alpha and edge. It walks you through the entire process here and it calculates the odds of the winner
being like the the Dodgers. uh and it gives them a 21.6% chance of uh winning uh this year. So,
and it took approx approximately four and a half minutes to compute. Yeah, that's a lot of thinking.
Yeah,
we can also look at all uh the other benchmarks besides HLE. Um, as it turned
out, GR 4 excelled on all the reasoning benchmarks that people usually test on.
Um, including GBQA, which is a PhD level, uh, problem sets. Uh, that's
easier compared to HRE. Um, on Amy 25 American Invitation Mathematics exam, we
with graph for heavy, we actually got a perfect score. Um, also on some of the
coding benchmark called live coding bunch. um and also on uh HMMT, Harvard
math uh MIT u exam and also USMO. Uh you can see actually um for on all of those
benchmarks we often have a very large leap against the second best uh model
out there. Yeah, it's I mean really we're going to get to the point where uh it's going to
get every answer right in every exam. um and where it doesn't get an answer right, it's going to tell you what's
wrong with the question. Or if the question uh is ambiguous, disambiguate the question into answers A, B, and C
and tell you what it would what answers A, B, and C would be with a disambiguated question. Um so the only
real test then will be reality. Uh can I make useful technologies, discover um
new science? That will actually be the the only thing left because human tests will simply not be um meaningful.
Yeah, we need to make an update to HRE very soon given the current rate of progress. So yeah, it's super cool to
see like multiple agents that collaborate with each other solving really challenging problems. Uh so we're
going try this model. Uh so it turned out it's available right now. Uh if we advance to the next slide uh where there
is a super Grock heavy tiers that we're introducing where you're able to access to both Grock 4 and Grock 4 heavy uh
where you're actually going to be the taskmaster of launch of little Grock research agent to help you you know become smarter through all the little
research and save hours of times of uh you know going through mundane tasks uh and it's available right now. Uh so yeah
so uh we we did limit usage during the the demo so we didn't it didn't break
the demo because all these this all this stuff is happening live so it's not there's not nothing canned about the any
of the tests that we're doing. Um so uh after the after the uh demo is done we'll we'll allow we'll enable more
subscribers for cyber gro. So if you can't subscribe right now just try in half an hour it should work. Um, so, uh,
and now let's let's get into voice. Greatby.
So, many of you have been enjoying our voice mode and we've been working hard to improve the experience over the past
couple months. Um, we have cut latency in half to make it much snappier. And
today, we're excited to announce a set of new voices that have exceptional naturalness and procity. You might have
noticed the movie trailer voice that opened up the live stream. That is one of our new voices, S, who's got that
epically deep tone. And we're also excited to introduce Eve, a beautiful
British voice who's capable of rich emotions. Aman, would you like to introduce Eve?
Absolutely. Yeah. Let's get into the demo. Hey, Eve. Millions of people are
watching us right now. Are you excited? Oh, I'm thrilled. Really, it's like
being on stage at the Old Vic with a sea of faces out there. Just imagine all those eyes and ears tuned in. It's a
proper buzz, isn't it? What do you reckon we should chat about to keep this lot entertained? You know, we have a lot to talk about,
but I'll be honest. I'm feeling a little nervous. Can you whisper something soothing to calm me down?
Take a deep breath, love. You've got this. It's just you and me having a
quiet chat. Like we're tucked away in a cozy corner of a Yorkshire pub. The
world's just a murmur out there. Feel that calm wash over you. Let's keep
talking nice and easy. What's on your mind? You know what? I'm actually feeling
quite relaxed to try this um something crazy. So, can you give us an opera on Diet Coke?
Oh, an opera on Diet Coke. That's a riot. Picture this. A shimmering silver
cannon glistening under stage lights. Oh diet coke, thou elixir divine with
bubbles that dance in a sparkling line. Thy crisp cool kiss on lips so fine. Oh,
aspartame and bro forever mine.
How's that for a mad little Arya? Want me to belt out another verse or switch up the tune?
Okay, let's try switching it up and just start singing. Right, then let's have a go. Oh, silver
can, you gleam in the light, diet coke, my heart's delight. Okay, I'll touch it. It's very, very
poetic. Um, yeah, thank you, Eve. Feeling quite nice now. Cool. Um, so that's Eve. And um so
something else we do here is is a little test each day to check on the natural
flow of our conversations against some of our competitors. So we figured why
not try that live on the stream today? Uh should I start?
Hey, I'm going to say a number and can you say the same number back to me? One. One. Got it. This is like a fun little
game of echo. What's next? Two.
Sorry, you said two. Three.
Three. Four. Nice and steady. What's next? Um. Uh. Five.
Five. Nailed it. This is pretty fun, actually. Feel free to keep going or switch it up if you'd like.
Hey Grock. Hey Grock.
Hello there. How can I help you today? We're going to do a little test. Can you repeat after me? One.
One. All right. What's next? Two. Two. What's on your mind?
Three. Three. Need anything else? Four.
Four. How can I assist you? Five. Five. What's next?
So, as you can see, um, Groc was, uh, snappier. It didn't interrupt me. And the procity is we've made different
design choices. I think we shooting for something more calm, smooth, more natural versus something that's more
poppy or artificial. So, we'll keep improving on these fronts. All right. Thanks, guys. Yeah.
Yep. So since the launch of the voice model uh we actually see the 2x faster
end to end latency uh in the last 80 weeks five different voices and also 10x
the active user. So Grock voice is taking off. Now um if you think about
releasing the models this time we're also releasing Grock 4 through the API at the same time. So if we go to the
next two slides. So uh you know we're very excited about you know what all developers out there is going to build.
So you know if I think about myself as a developer what the first thing I'm going to do when I actually have access to the
graph for API benchmarks. So we actually ask around on the X platform what is the
most challenging benchmarks out there that you know is considered the holy grail for all the AGI models. Uh so turn
out AGI is in the name RAGI. So uh the last 12 hours uh you know kudos to Greg
over here in the audience uh so who answered our call take a preview of the
Gro for API and independently verified you know the gra for's performance. So initially we thought hey graph is just
you know we think it's pretty good it's pretty smart uh it's our nextgen reasoning model spend 10x more compute
it can use all the tools right but turned out when we actually verify on
the private subset of the RKGI v2 it was like the only model in the last three
months that breaks the 10% barrier and in fact was so good that actually get to 16% well 15.8% 8% accuracy 2x of the
second place that is the cloud for outpus model um and it's not just about
performance right when you think about intelligence having the API model drives your automation it's also the
intelligence per dollar right if you look at the plots over here the gro is just four just in the league of its own
um all right so enough of benchmarks uh over here right so what can gro do actually uh in the real world So uh we
actually uh you know contacted the folks from uh uh Endon Labs uh who you know
were you know gracious enough to you know try to gro in the real world to run a business. Yeah thanks for having us. So I'm Axel
from Labs and I'm Lucas and we tested Gro 4 on vending bench. Vending Bench is an AI
simulation of a business scenario uh where we thought what is the most simple
business an AI could possibly run and we thought vending machines. Uh so in this scenario the the Grock and other models
needed to do stuff like uh manage inventory, contract contact suppliers,
set prices. All of these things are super easy and all of they like all the models can do them one by one. But when
you do them over very long horizons, most models struggle. Uh but we have a leaderboard and there's a new number
one. Yeah. So we got early access to the GR 4 API. Uh we ran it on the bending bench
and we saw some really impressive results. So it ranks definitely at the number one spot. It's even double the
net worth which is the measure that we have on this value. So it's not about a percentage on a uh or a score you get,
but it's more the dollar value in net worth that you generate. So we were impressed by Grock. It was able to
formulate a strategy and adhere to that strategy over long period of time much
longer than other models that we have tested other frontier models. So it managed to run the uh simulation for
double the time and score yeah double the net worth and it was also really consistent uh across these runs which is
something that's really important when you want to use this in the real world. And I think as we give more and more
power to AI systems in the real world, it's important that we test them in scenarios that either mimic the real
world or are in the real world itself. Um because otherwise we we fly blind into some uh some things that uh that
might not be great. Yeah, it's uh it's great to see that we've now got a way to pay for all those GPUs. So we just uh need a million
vending machines. Definitely. Um, and uh we could make uh $4.7 billion a year with a million
vending machines. 100%. Let's go. They're going to be epic vending machines. Yes. Yes. All right. We are actually going to
install vending machines here. Uh like a lot of them. We're happy to supply them. All right. Thank you.
All right. I'm looking forward to seeing what amazing things are in this vending machine. That's that's for uh for you to decide.
All right. Tell the AI. Okay. Sounds good. Um All right. Yeah. I
mean, so we can see like Grock is able to become like the co-pilot of uh the business unit. So what else can Grock
do? So we're actually releasing this Grog if you want to try it uh right now to evaluate run the same benchmark as
us. Uh it's on the API um has 256k
contact length. So we already actually see some of the early early adopters to try Guac for API. So uh our Palo Alto
neighbor ARC Institute which is a leading uh biomedical research uh center is already using seeing like how can
they automate their research flows with Gro for uh it turned out it performs it's able to help the scientists to
sniff through you know millions of experiments logs and then you know just like pick the best hypothesis within a
split of seconds. uh we see this is being used for their like the crisper uh research and also uh you know Grog 4
independently evaluate scores as the best model to exam the chess x-ray uh
who would know um and uh uh on in the financial sector we also see you know
the graph with access to all the tools real-time information is actually one of the most popular AIs out there so uh you
know our graph is also going to be available on the hyperscalers so the XAI enterprise sector
is only, you know, started two months ago and we're open for business. Um,
yeah. So, u the other thing, uh, we talked a lot about, you know, having Grock to make games, uh, video games.
Uh, so Denny is actually a, uh, video game designers on X. So, uh, you know,
we mentioned, hey, who want to try out some, uh, uh, Grock for uh, preview APIs, uh, to make games? And Danny
answered the call. Uh so this was actually just made first-person shooting game in a span of 4 hours. Uh so uh some
of the actually the unappreciated hardest problem of making video games is not necessarily encoding the core logic
of the game but actually go out source all the assets all the textures files
and and uh you know to create a visually appealing game. So one of the core aspect guac for does really well with
all the tools out there is actually able to automate these like asset sourcing
capabilities. So the developers you can just focus on the core development itself rather than like you know so now
you can run a you know entire game studios with game of one but with like
one person and then uh you can have gro 4 to go out and source all those assets do all the maintaining tasks for you.
Yeah. The now the next step obviously is for Grock to uh play be able to play the
games. So it has to have very good video understanding. So it can play the games and interact with the games and actually
assess what whether a game is fun and and and actually have good judgment for whether a game is fun or not. Um so with
the with version seven of our foundation model which finishes training this month and then we'll go through post training
RL and whatnot. um that that will have excellent video understanding. Um and
with the with video understanding and the and improved tool use, for example, for video for video games, you'd want to
use, you know, Unreal Engine or Unity or one of the one of the the main graphics engines. um and then gen generate the
generate the art uh apply it to a 3D model uh and then create an executable
that someone can run on a PC or or a console or or a phone. Um
like we we expect that to happen probably this year. Um and if not this
year, certainly next year. U so that's uh it's going to be wild. I would expect
the first really good AI video game to be next year. Um,
and probably the first uh half hour of watchable
TV this year and probably the first watchable AI movie next year. Like
things are really moving at an incredible pace. Yeah. When Gro is 10xing world economy
with vending machines, they would just create video games for human. Yeah. I mean, it went from not being
able to do any of this uh really even 6 months ago, right, to to what you're seeing before you here
and and from from very primitive a year ago uh to making
a a sort of a 3D video game with with a few hours of prompting.
Yep. I mean yeah just to recap so in today's live stream we introduced the
most powerful most intelligent AI models out there that can actually reason from the first principle using all the tools
do all the research go on the journey for 10 minutes come back with the the most correct answer for you. Um so it's
kind of crazy to think about just like four months ago we had Gro 3 and now we already have Gro 4 and we're going to
continue accelerate as a company XAI we're going to be the fastest moving a AGI companies out there. So what's
coming next is that we're going to, you know, continue developing the model that's not just, you know, intelligent,
smart, think for a really long time, spend a lot of compute, but having a model that actually both fast and smart
is going to be the core focus, right? So if you think about what are the applications out there that can really
benefit from all those very intelligent, fast and smart models and coding is actually one of them.
Yeah. So the team is currently working very heavily on coding models. Um I think uh right now the main focus is we
actually trained recently a specialized coding model which is going to be both fast and smart. Um and I believe we can
share with that model with you guys with all of you uh in a few weeks. Yeah.
Yeah. That's very exciting. And uh you know the second after coding is we all see the weakness of Grog 4 is uh the
multimodal capability. So in fact uh it was so bad that you know Grock
effectively just like looking at the world squinting through the glass and like see all the blurry uh you know
features and trying to make sense of it. Uh the most immediate improvement we're going to see with the next generation
pre-trained model is that we're going to see a step function improvement on the model's capability in terms of image understanding video understanding and
audios. Right? It's now the model is able to hear and see the world just like any of you. Right? And now with all the
tools at this command with all the other agents it can talk to uh you know so
we're going to see a huge unlock for many different application layers. uh after the multimodal agents what's going
to come after is the video generation and we believe that you know at the end of the day it should just be you know
pixel in pixel out u and um you know imagine a world where you have this
infinite scroll of content in inventory on the X platform um where not only you
can actually watch these generate videos but able to intervene create your own adventures
if you're just going to be Um, and we expect to be training our video model with uh over 100,000 GB200s
uh and uh to begin that training within the next 3 or 4 weeks. So, we're we're
confident it's going to be pretty spectacular in in video generation and video understanding.
So, let's see. So that's uh
anything anything you guys want to say? Other than that, I guess that's it.
Yeah, it's it's a good model, sir. Good model. It's a good Well, we're very excited for you guys to
try Gro 4. Yeah. Yeah. Thank you. All right. Thanks, everyone. Thank you. Good night.