Cellular Automaton Engine with C++ AMP – Part I
For my first contact with cellular automata I thank to science fiction author Greg Egan and his brilliant novel Permutation City. It is hard-core sci-fi – not like wannabe-scientific fantasy fairy tales which are in form of books and movies everywhere today trying to persuade people that they represent sci-fi genre. At the time of reading it I did not even understand what cellular automata exactly is (in this novel, artificial life simulator was based on it). But later I spent some time trying to understand basics and I found a book which can be without exaggeration considered the bible of cellular automata: New Kind of Science (NKS) by scientist Stephen Wolfram. It is huge book and although I bought it some years ago, I did not manage to read it all so far. It looks at cellular automata from all angles, surprisingly covering many disciplines where the knowledge can be applied. Although this work is controversial and by many not so revolutionary as Wolfram claims, it helped me understand one thing: It is something fundamental. I was always attracted to fundamental things, things which are simple and capable to explain complex phenomena. That’s why I was interested in theoretical physics in the past (“popular science” level only) and that is also why for example I don’t care about mess which is in the field of elementary particles today (I am not interested in Higgs boson. It is just not fundamental enough.)
So what really is cellular automaton (CA)? I am not going to explain what it is, you can simply click on hyperlink to find out. I will say what it means for me: CA for me is an example of how complexity emerges from nothing. Understanding this has interesting implications and great philosophical consequences; it significantly changes how I look at the world, at the questions of our being, life and reasons why we and nature exist in the form we observe. But I will keep this philosophical aspect aside, because it is not important for what I want to do. It is just reason why I think it is such an exciting topic.
For purpose of the following, I can describe CA as discrete deterministic system of simple rules iteratively applied to n-dimensional space. In each step we apply the rules to find out the state of each cell of the space (usually cell state is 1 bit – “On” or “Off”). Inputs for the rules are states of the cell and neighbouring cells from previous iteration. I am going to experiment with 1-dimensional CA as Wolfram does in NKS. It maybe does not look so sexy as for example two-dimensional Conway’s Game of Life or CA in more dimensions, but it is fundamental and reduced to its essence. From now on I am talking about Elementary cellular automata only.
Programming massively parallel non-blocking programs
At first something about my motivation. I work as a software developer in Microsoft ecosystem (.NET, C#) currently specializing to GUI (Silverlight) development. Although it is challenging and definitely not boring, there is something lacking in it. I do not want to completely reduce it to simplification that what most developers today work on is just “screen, form, database” and variations on it. I would borrow words of a friend who said it is a craftsmanship. Creative and exciting, but craftsmanship. It is engineering work with less and less computational problems rather more focused on integration and orchestration of tens of internal and 3rd party components and subsystems. Where is pure computer science?
Therefore for balancing what I work on professionally, I want to widen my horizons at home. What interests me now is pure computation, parallelism (with minimum blocking) also with less abstractions and 3rd party components, closer to the “iron”. 100% utilization of resources by efficient algorithms.
Next take on GPU computing: C++ AMP
We live in such an exciting time. We recently realized capabilities of GPUs for different applications. Although identifying data-parallel problems and implementing them requires paradigm shift, taking advantage of GPUs is more and more common. But still, general purpose computing on GPU (GPGU) is in its early stages. What technology / stack to pick? CUDA which is I think the most mature architecture with amazing tooling but it is NVidia specific? Or OpenCL which is for heterogeneous platforms, with similarities to OpenGL? Or DirectCompute which is heterogeneous only from hardware perspective? With all of them requiring to use special language (C for CUDA, GLSL, HLSL) for shaders. Or something on top of mentioned adding another abstraction layer?
I decided I will go for something new: C++ AMP. It is new, not yet finally released technology from Microsoft. Their ambition is standardize it as part of C++ specification. I do not know how successful it will be and it is possible it will be forgotten. But I hope not. I am quite excited about it.
What I like it does not force you to use any special language. It is part of C++ (C# would be nice as well but I admit starting with more low-level general purpose language is logical). It allows you to annotate methods or lambdas with the new restrict keyword, which is used to tell compiler where the method can be executed. Your regular C++ program can contain blocks which will be executed on accelerator (GPU) with advantage of compile time checking for restrictions of target hardware (no writing to console on GPU ;-)).
It is built on top of DirectCompute (compiler generates HLSL for you and embeds it into the executable), but wait – it is future proof. DirectCompute is only current implementation detail and for example NVidia can provide CUDA implementation later. And when sometime in future GPU vendors come with unified interface to GPUs, this abstraction layer can be taken away completely.
There are other reasons to be excited about AMP, for example debugging experience is awesome. I highly recommend this video to everyone who is interested in this technology.
There is one big drawback. Early adoption of something, what does not have even stable release will be pain. I know it. I realize that this is not the most comfortable way to pick (CUDA is), but learning something new which can change GPGU field and make it really platform independent and “general purpose”, is worth it.
And I almost forgot: Why to use GPU for CA? Because it is data parallel problem (same algorithm applied to lot of data without need for synchronization). (We will see that it is not that simple and some synchronization is required).
First (naive) implementation and lessons learned
Objective of this first exercise was modest: running simplified CA algorithm and making the GPU sweat. It was not important whether it is efficient now, this was to try out AMP and discover concepts, limitations and possibilities of it. I also wanted to hear fans of my GPU (without playing a game :-)).
Program is simulating 1-dimensional CA, space length is fixed. Pure simulation, no visualisation.
Algorithm exposes following interface (C#)…
BitArray Run(int spaceLength, int ruleNumber, int iterations, int initialTrueIndexes)
…where spaceLength parameter is number of cells, ruleNumber parameter allows you to specify rules of CA using Wolfram’s numbering scheme, iterations is number of iterations for the automaton. Instead of providing initial state for all cells of CA, initialTrueIndexes allows you to specify which cells are initially in “On” state (usually one cell). Return value represents the whole space after the CA simulation.
I exposed functionality as C# library which is wrapper around C++ library. For this I used approach described here (see also here). I had to refresh my C++ knowledge (including bit operations). The biggest problem I faced was missing outer brackets in #define directives what caused me lot of pain (yes, lame mistake).
CA “engine” is implemented in Engine.cpp file (whole solution is included in this blog post). I am not going to discuss it in big detail, because I realized it is kind of throw-away prototype. It has various flaws which I will describe.
I decided to do data parallelism on cell level with 1-bit granularity, that means 1 thread per each cell (which is one bit). Obviously, bit is not addressable, so I could use int for one cell instead (byte is not supported on GPU), but it looked like huge inefficiency (it would use 32x more memory than it essentially needs). And running 32 threads on the same data (one int) is basically crazy idea, because of required synchronization. So what I have done is I decreased level of parallelism by simulating 32 cells by one GPU thread which computes their states serially. This decrease of parallelism was mistake I will not repeat, because massive parallelism is what I wanted to gain from GPU originally.
Another rather fundamental flaw was that I was designing it with the assumption that it will be whole executed on GPU and CPU will simply wait until GPU runs required number of iterations. I did not realize there is an synchronization issue. Because state of cell depends not only on its previous state but also previous states of its neighbours, there has to be some barrier which will cause threads to wait. Global barrier for the whole space is logically not required and inefficient too (and I do not even know whether AMP / GPU supports something like that). Another option is barrier on the tile but that is not possible because CA space cannot be split into subspaces which are independent. There are other reasons why this was wrong approach. Long running GPU kernel (> 2 sec.) will cause graphics driver to restart, so apparently this was not supported use case (so no, I did not make GPU fan noisy). Other future features like read-back of space during the simulation, expansible and sparse space, all of those will require CPU, because of GPU limitations I learned.
Current naive implementation does not contain synchronization and therefore it can theoretically return unpredictable results. I created something which is not correct and has flaws, but it pointed me to new directions, stimulated me to discuss it with others and to think about something better. But it essentially runs CA and I learned a lot about not only AMP but GPGU paradigm generally. Objective partially met.
Next and the proper solution will consist of part executed on CPU and part executed on GPU. CPU will intelligently dispatch subspaces to GPU for computation of partial results. It will manage expansion of space and synchronize interdependencies between subspaces. Only what is required will be computed (number of dispatched subspaces will grow with increase of cells in “On” state). It will allow to read-back the space during the simulation (later maybe also visualize it). I will also consider creating general n-dimensional engine. Because of the high cost of CPU memory <-> GPU memory copying, it will not be trivial and has yet to be analysed properly.
Since it will be tight cooperation between CPU and GPU executed code, I am planning to implement both parts in C++. Wrapping GPU code by managed wrapper just adds unnecessary complication. I will get rid of C# completely.
To be continued… I am looking forward but I suspect it won’t be anytime soon :-(