# Evolution of the Linux kernel source code tarball size

10 November 2011

Here is a graph showing the evolution of the size of the different linux.tar.bz2 source code packages. It starts with version 1.0 and finishes with the 3.1. We see that the evolution is mostly exponential, we could try to predict that linux-3.19.tar.bz2 should be around 100MB.

You can grab the data I used and the gnuplot file for generating the graph.

[...] ????? ????????? ?????????? ???????????? ????? ??????? ???? ???? Linux, ??????????? ?? ?????? [...]

Does this mean that Linux is getting bloated?

http://www.theregister.co.uk/2009/09/22/linus_torvalds_linux_bloated_huge/

“Citing an internal Intel study that tracked kernel releases, Bottomley said Linux performance had dropped about two per centage points at every release, for a cumulative drop of about 12 per cent over the last ten releases. “Is this a problem?” he asked.

“We’re getting bloated and huge. Yes, it’s a problem,” said Torvalds.”

And next question; what should we do about this problem?

[...] ????? ????????? ?????????? ???????????? ????? ??????? ???? ???? Linux, ??????????? ?? ?????? [...]

Nice plot. But, just for curiosity:

Why do you fit a, b, c in a function f(x)=a*b**(x/c)+1, which is f(x) = 1+a*exp(x*ln(b)/c)? In other words, your fitted function equals f(x) = 1 + 0.175822*exp(x*0.02011), which uses only two parameters.

@Stefan Huber

Well, I tried but gnuplot didn’t succeed to fit the data with f(x)=a*exp(b*x)+1 function but it did with a third parameter. I could have try to fit with an other software, but actually, it doesn’t matter much and gnuplot is widely available so people can check themselves quickly.

@Michael

The kernel size is increasing and it’s absolutely normal. Is it bloated and getting slower? These data don’t mean anything about this. You should ask Linus :-)

Looks like you are writing russian and wordpress doesn’t like it :-(

[...] ?? ?? ???? ???? ???? ? ?????? ?????? ???? ?????? [...]

Since 3.8 will become 4.0, the 100 MB will only be reached by version 5.3 ;)

It would be interesting to add the release dates to the data file, and then make a graph in function of time.

Hi,

It would be interesting to do the same with all the hardware support remove (arch and drivers, from where most of the bloat is likely to come from)…

Thanks for the nice graph.

François.

It would be more representative if the independent variable chosen (x) were the Time (in days or weeks) of the kernel release and not it’s Version number, which is an arbitrary set (albeit monotonically increasing) number.

Perhaps the monolithic kernel should become more modular. Drivers, arch ect… could be contained in modular data file or data base. Once a machine configuration is probed and analyzed, the necessary modules can be attached to that system thus omitting the bloat creep with a monolithic kernel.

Keep the core system tight and pretty. Storage is cheap now days and with drivers, arch and all that could be kept offline so nothing is really “Derezzed” and put out on the game grid.

Just FYI concerning the fitting problem. As you said, fitting f(x) = 1+a*exp(b*x) does not succeed. However, fitting f(x) = 1+a*b**x does and I get a=0.175663 and b=1.02032.

If you want to fit an exp(.) function you need to help gnuplot a bit with proper starting values. Setting a=1 and b=0.2 makes gnuplot to succeed with fitting f(x)=1+a*exp(b*x) and the results are a=0.176095 and b=0.0201094.

The gnuplot interactive help on “fit” (subtopic tips) has some further remarks.

Best regards and thanks for the data.

Doesn’t the source code size tell us mostly what is available?

With this chart we also need disk sizes and memory footprints of built kernels. Popular ones like Ubuntu and Fedora and custom ones like Gentoo. Sure there is feature creep over the years but there is also a dramatic increase in supported hardware. I build a monolithic kernel with just what I need in it with one loaded module only, the graphics driver. With mostly the same config, my compressed kernel image has gone from 2.6MB to 4.7M recently. I don’t really have a good way to say what the memory footprint is.

The Linux kernel is what you “Make” it. :-)

I like it that you used gnuplot. I’ve been using gnuplot since my first Linux install (1995?) and still use it.

Yup, the source code is getting more and more extensive. That’s mostly because, while supporting the latest technology available to both home users & enterprise businesses, Linux continues to support legacy hardware as well.

Since this is only the source code, it does not necessary mean the resulting kernel is growing and growing as well.

I suppose the kernel community has considered what Dave Svenson suggested.

Would kernel modularity help distros to produce modular start-up images so that they would fit in cds?

Or are there technical or practical reasons why this idea has not been realised?

Thank you for the graph!

[...] explica Jérôme Pinot en la lista LKML y en el post de su blog, el crecimiento es en su mayoría “exponencial”, y si se mantiene al mismo ritmo, el [...]

@Benedict Verhegghe

Yes, it would be interesting too, but it leads to several problems. First, finding and gathering good data about release dates. Not sure you can trust any mirror about this. Second, development of branches occurred in parallel for a long time (like 2.4/2.6). So if you want to plot kernel size against time, you have to do several graphs, one for each branch. It does not give you the same big picture.

@François Guerraz

Well, sure, there is more than one way to do this, I will think about it :-)

@Guillermo Burastero

See

Interview of Linus Torvalds:

Linux has become too complexThe first Linux version had 10,000 lines of code. Today we have around 15 million lineshttp://www.zeit.de/digital/internet/2011-11/linux-thorvalds-interview

English translation:

http://translate.google.com/translate?sl=de&tl=en&js=n&prev=_t&hl=fr&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fwww.zeit.de%2Fdigital%2Finternet%2F2011-11%2Flinux-thorvalds-interview

@Stefan Huber

Thanks for the tip!

Thank you Jo Last, modularity is the way to go. Programming with this methodology seems to be a lost art but I remember from my old computer science classes back in the 80′s this was one of the methods all formal accredited programmers were to be taught.

This philosophy allows various decentralized teams to work on a common project and modifying with relative ease as long as documentation within the code is well developed.

The structure is a encapsulated black box, the procedure or function were the inputs and outputs are well defined essentially, your software is a series of Lego blocks {Kernel}. With a well developed modular program, you can pull modules in and out at will but the core system will continue to function.

Refs:

Modular programming – http://en.wikipedia.org/wiki/Modular_programming

Benefits of modular programming – http://netbeans.org/project_downloads/usersguide/rcp-book-ch2.pdf

How to use Modular Programming example – http://markalexanderbain.suite101.com/how-to-use-modular-programming-a139055

[...] explica Jérôme Pinot en la lista LKML y en el post de su blog, el crecimiento es en su mayoría “exponencial”, y si se mantiene al mismo ritmo, el kernel [...]

I end up with a completely different graph: http://janvandermeiden.blogspot.com/2011/11/linux-kernel-size.html

When looking at data from http://ftp.dei.uc.pt/pub/linux/kernel/…

A simple exponential is probably not the best choice for modelling growth in this case. Kernel growth is no doubt more akin to biological growth in the sense that there is an asymptotic approach to some maximum size – clearly the kernel cannot continue to grow at the rate predicted by the simple exponential. My choice would be a Boltzmann function:

f(x)=A2 + (A1-A2)/(1+exp((x-xh)/s)), where

A1 and A2 are the beginning and ending asymptotes, xh is the value of X where f(x) is exactly 1/2 way from A1 to A2, and s is the “Steepness” of the transition. This function has a typical exponential character during the initial (lag) and growth phases, but then has an asymptotic upper bound. In fact, your graph is showing a hint of flattening out at the top.

Boltzmann Follow up:

I did a quick fit using a Boltzmann and here are the parameters:

R^2 = 0.9857

A1 = 1.2246e+00 +/- 1.2314e-01

A2 = 6.9598e+06 +/- 3.2498e+10

Xh = 8.6267e+02 +/- 2.2929e+05

S = 4.9103e+01 +/- 1.2892e+00

The fit is as good or better than the simple exponential (note the R^2 value). However, there are a few observations worth making. Aside from the fact that it predicts version 8.6 of the Kernel to be the point of half maximum code size, and also predicts a 7 million megabytes asymptote,the standard errors clearly indicate that there is something drastically wrong with these predictions. It could be the way the X-Axis is constructed or something else fundamentally wrong with a predictive model that looks at version numbers vs code size. It could also be that there is simply not enough data at the high end of the curve to estimate the asymptote. Still, it’s fun to try and predict.

Another Boltzman follow up:

OK, I took my own suggestion and used a different function of the version number for the X-Axis. The result is a much more “jumpy” graph, but a much better fit in terms of standard errors of the parameters. The x-axis values were computed from the version number by taking 10 times the major version and adding the minor version. To this I add the revision number divided by 1000. This, a version like 2.3.35 becomes 23.035. The division by 1000 is needed to keep the revisions monotonic (i.e., otherwise .100 is less than .12, which is clearly incorrect in terms of version sequence). It may also tend to undervalue the effect of a revision on code size, contributing to the “jumpiness” of the resulting graph. Using these values as the x-axis, I now get these fit parameters:

R^2 = 0.9133

A1 = 2.2963e+00 +/- 8.0424e-02

A2 = 7.7939e+01 +/- 9.1966e-01

Xh = 2.5485e+01 +/- 4.7143e-02

S = 1.7609e+00 +/- 2.1751e-02

Note the much better standard errors. They are small enough that I even have some confidence in the predictions – half asymptote code size at version 2.5.48, predicted max. code size in the 80 MB range. With a few more data points added for future kernel releases, this may actually become predictive.