The LLM experiment

December 24, 2025

First of all I am not an academic genius.

I do not read every research paper and thesis on AI, so this concept might as well be copying something that already exists or something that the researchers who are really in the trenches have talked about for years.

I started the post with this statement so that some rando won't call me out for being a fool and copying something I never heard of word for word (with this same exact concept; god forgive). Anyways:

What is The LLM Experiment

I have had this idea for quite some time. Finally decided to get it down "on paper".

Singularity is the point when AI can improve itself, leading to infinitely better models and technological prosperity beyond our imagination.

Measuring how to get to this point is an immense challenge. Not something easy at all. I have an interesting experiment on how to test this.

How it works:

The goal is to use third party LLMs, for example Claude 4.5 Opus (we will refer to this as the base model) and give the model the tools and structure to start to build its own LLM (we will refer to this as the child model). For example, if I worked with Claude Code, I could probably develop a POC of a super simple LLM. BUT there is a human in the loop, reporting back the errors, doing tasks, launching scripts etc.

How can we give the agent the structure to control itself from the start? Giving the base model the tools and structure to develop a child model completely from scratch. This would include everything. It would mean that the base model finds a way to scrape the training data, does all the math and coding and everything in between, controls the compute infrastructure, version control, you name it.

To start with, humans have to build the development tools that the base model needs to develop the child model in the actual repo that the target code would go. What will the base model need? It will need tools to read, create, edit files. It will need tools to create terminals so that it can execute code, install scripts or whatever. It will need tools to set up and manage cloud infrastructure. It will also need tools to manage its own memory. If a model has a context length of one million tokens that is nothing... it will need infrastructure to launch subagents for different tasks, set up todo lists for short term and long term horizons etc; to actually make progress going around the limitations of the model.

This is very important: The whole thesis is built on that humans develop the base tools.

The base model gets access to the codebase where it is supposed to build the child model, but that codebase also includes all the base tools that humans wrote. Meaning if the base model identifies the need for better documentation, it should be able to develop the tools for that itself. The goal in the end is for the child model to surpass or equal the quality of the base model (the base model is always the best publicly accessible model). The first major step would be to get to the point where humans no longer have to assist the base model in developing / guiding the tools and general direction. That itself is probably a while away. Then if we ever get to the point where: child model >= base model. We have reached or we are at the doorstep of singularity.

One of the key requirements here:

Humans never interfere with the actual development of the AI (reverting the AI's work will realistically have to be done by humans). Humans are only allowed to improve the models by improving the tooling around it, improving the base prompts etc. We are not allowed to tell it what to do.

This project would of course be open source. One of the interesting problems on the human side is that this will cost a lot of money to conduct, millions of dollars. Compute and training an LLM is not cheap, even harder when the model can only rely on cloud infrastructure which brings up the costs even more. So we humans would have to develop a clever probably PayPal or Stripe based donation system and a clever way for AI to set up budgets, control spending and all that stuff, where people can just send money to the PayPal account to support the project. There would have to be a monitoring layer where humans can see all the actions that the AI does. This is where we would have to be clever to not dox the payment part.

Key Clarifications:

"This will never work and it will waste money for years before any meaningful result is produced." Well, that is kind of the point. We are not expecting it to work with the current base models. The definition of AGI is that we have reached human level intelligence across all domains. People push back on this experiment saying "AI will never be able to do the base math, come up with the training data yadayada." If we have reached AGI, the models SHOULD be able to do what is described in this experiment. Failing is expected and just as interesting as succeeding. The point is not for it to succeed, the point is that it is a metric on progress.

"If the child model ever gets to the level of Claude 4.5 Opus, Claude 5, 6 or the most recent human produced models will outdo that." Human labs are limited by the constraint of time. A human researcher only has 24 hours in a day. If we ever get to the point where the child model is at the level of Opus 4.5, we can scale up the agents infinitely and we have basically reached the point of singularity.

"Self replication has been discussed by AI safety advocates for decades. It is not something new..." I look at this from the angle that to even get to the point where self replication is possible, we once again go back to the tooling problem. Do we expect ChatGPT to randomly start developing an LLM? How are we ever going to get to the point where we can experiment with self replication, when the foundation for the models to stand on is missing: tooling.

I think about this with the following analogy. Today almost anyone that puts their mind to it could build a house. This is relatively easy today because we humans have spent centuries inventing the tooling required; the hammer, the saw, various electric tools etc. If you were to start completely from scratch with the single goal to build a house in 10 years, you would not spend 10 years doing push-ups making yourself stronger. To me doing the push-ups is us training the models. Strength should not be neglected, but what helps the most is taking the time to construct some basic tooling to assist you during the construction. This analogy applies to our situation. Tooling, the part of self-replication that is underexplored, is just as important, maybe even more than raw model strength. (Where model strength is defined from the point of me writing this, of course GPT-3 couldn't self replicate no matter how good tooling we placed around it.)

So yea, I would call this the LLM Experiment and believe it would be interesting no matter the results.