How Sh Works

From ShWiki

Contents

Prologue

This article provides an introduction to how Sh is built and why Sh works the way it does. We introduce the ideas behind Sh step by step, by giving examples of how a similar language built inside C++ might evolve. It is crucial that you study the code examples carefully in this article while reading it.

Sh is a C++ library. It acts as a language that allows you to program coprocessors such as Graphics Processing Units (GPUs) from within C++ applications conveniently.

The concept of a language embedded in C++ (or in other languages in general) for a specific purpose is not new. Programming language theorists call such a language an embedded domain-specific language (EDSL for short). Most other C++ EDSLs are based on the idea of template metaprogramming, which uses some advanced C++ language features to build a language that gets compiled alongside of C++. This results in code embedded inside of C++ programs that can be very efficient or typesafe. For example, the blitz++ library allows one to specify matrix computations within C++ that execute very efficiently.

There are some important differences between Sh and EDSLs such as blitz++. For example, Sh does not use template metaprogramming and Sh programs compile at runtime rather than C++ compile time. Sh programs also don't have to run on the same processor as the C++ application they are embedded in (although they can do so).

A first attempt

So how does Sh work? Well, let's suppose we had read the above three paragraphs and wanted to rewrite Sh without knowing anything else about it. Suppose we wanted to use our implementation to multiply a C++ variable with an array of a thousand floating point numbers (this is a tiny example of the sort of thing you can do with Sh). After a bit of pondering, perhaps we'd start with something like this:

( 1) Program program = Program(
( 2)   "input_float a;\n"
( 3)   "float b;\n"
( 4)   "output_float c;\n"
( 5)   "c = a * b;\n");
( 6) 
( 7) program.compile();
( 8) float a[1000], b, c[1000];
( 9) program.set_input("a", a);
(10) program.set_variable("b", b);
(11) program.set_output("c", c);
(12) program.execute(1000); // run on a thousand floats

What have we done here? We've built a language inside of C++, although just barely. We'd expect the compile function to parse the code we've passed in, have the set_input and set_output functions tie our embedded program's variables to our C++ variables, and have the execute function run the program (e.g. on the GPU) and put the results back into the C++ program.

This is a start, but not a very good one. For one thing, there's a lot of overhead in the above code. Lines 8 through 10 seem superfluous -- wouldn't it be much better to just share variables between the embedded language and C++? Also, there's a lot of work in that little compile() function. It has to parse the language that we're defining from scratch. If we want to add features to our language we need to implement them by hand. And since we're already programming in C++, we'd really love to use all of those fancy C++ features we've grown to love. Arrays. Functions. Classes. Inheritance. Templates. Implementing all of that would be the same amount of work as implementing a C++ compiler. That's the kind of workload that makes you want to crawl back in bed.

A questionable improvement

So let's think about how we can improve on this. Let's first think about the parsing problem. Passing the program down as text gives us a lot of flexibility in terms of syntax, but it is also hard to implement and not very flexible in terms of making variations on the programs we're defining easily. So here's another attempt:

( 1) Program program;
( 2) Input input = program.add_input();
( 3) Output output = program.add_output();
( 4) Variable var = program.add_var();
( 5) program.add_computation(MULTIPLY, output, input, var);
( 6) program.compile();
( 7) float a[1000], b, c[1000];
( 8) input.attach(a);
( 9) var.attach(b);
(10) output.attach(c);
(12) program.execute(1000); // run on a thousand floats

"Hang on," I hear you say, "that looks a lot worse than the previous program!". Yes, in this version we've given up our nice text-based syntax for a rather cumbersome looking manual way of specifying the program. But notice that we've gained a few advantages over the previous approach: we no longer need to parse a textual syntax anymore in our implementation. Our compile() function now simply generates code from the operation we specified, it doesn't have to do any parsing. We also gain the flexibility to modify the program with C++ code. For example, by replacing line 5 with:

program.add_computation(MULTIPLY, output, input, var);
for (int i = 0; i < 9; i++) {
  program.add_computation(MULTIPLY, output, output, var);
}

we can generate a program that multiplies var with input ten times instead of just once. And that without doing any annoying text based operations. If the above program does compile in C++, it won't fail to compile in our little language, at least not due to syntax errors.

Bring back the operators

So how can we make our syntax resemble the original program again? Well, Input, Output and Variable are classes. Classes in C++ support operator overloading. If we make the program we're defining globally accessible, we can do something like:

Variable operator*(Variable a, Variable b) {
  Variable t = global_program->add_variable();
  global_program->add_computation(MULTIPLY, t, a, b);
  return t;
}

In fact, if we just have the Variable (and Input and Output) constructors add themselves to the global_program, we don't need to use add_variable either. So our code snippet becomes:

( 1) Program program;
( 2) global_program = &program;
( 3) Input input;
( 4) Output output;
( 5) Variable var;
( 6) output = input * var;
( 7) global_program = NULL;
( 8) program.compile();
( 9) float a[1000], b, c[1000];
(10) input.attach(a);
(11) var.attach(b);
(12) output.attach(c);
(13) program.execute(1000); // run on a thousand floats

Wow, that's a lot better. Notice how our code suddenly looks like normal C++ code. But we know that, really, the operator* call in line 6 doesn't actually multiply anything. It just records the fact that a multiply is supposed to happen at this time. And note that we can still do something like:

output = input * var;
for (int i = 0; i < 9; i++) {
  output = output * var;
}

That's a lot more readable than the previous try. It's clear from the code what it does. Remember that there are 10 multiplies in our program generated from the above three lines. Our operator* just sees itself being called 10 times. It doesn't know that there's a for loop around it. This example is simple, but this is really powerful stuff. I'll explain the ramifications of this later.

Learn to share

Back to improving our language. Just like the previous version, our variable types really just represent variables in the program, they don't really have values attached to them. But why don't they have values attached to them? For input and output it might not make sense to attach values -- after all, they really stand for a thousand values, maybe even more. We don't want to cart all of those values with us. But var is just equivalent to a single float. So why do we need two variables for var, one in our little language and one in C++? Can't we just unify the two? The answer, of course, is yes, we can. We just need to distinguish when it's initialised.

( 1) global_program = NULL;
( 2) Variable b = 0.5; // Note: b declared with no global_program.
( 3) Program program;
( 4) global_program = &program;
( 5)   Input input;
( 6)   Output output;
( 7)   output = input * b;
( 8) global_program = NULL;
( 9) program.compile();
(10) float a[1000], c[1000];
(11) input.attach(a);
(12) output.attach(c);
(13) program.execute(1000); // run on a thousand floats

What's happened here? Well, we've made our Variable constructor a bit smarter. It recognises when it's declared outside of a program, and if so it allocates a normal C++ object to take along, and allows itself to be assigned to normal C++ values. By using "b" in our program on line 7, we've made a connection between our C++ program and our embedded program implicitly. As long as we define appropriate assignment operators, any assignment to such a variable will also update the value of the variable on whatever processor the embedded code runs on. Cool!

Also, if we're going to support multiplication and all the other math operators on our types, why not use them outside of embedded programs too? All we have to do is change our operator*:

Variable operator*(Variable a, Variable b) {
  if (global_program != NULL) {
    Variable t = global_program->add_variable();
    global_program->add_computation(MULTIPLY, t, a, b);
    return t;
  } else {
    Variable t;
    t.value() = a.value() * b.value();
    return t;
  }
}

Now we can use our Variable types in "immediate mode" as well, to do real computations. Of course, the if statement alone is going to make these types slower than an optimized math library, but it's pretty convenient to have and sure beats throwing an exception or signaling some kind of error otherwise!

Back to reality

So, we're getting pretty close. What does the above program look like in Sh? Well, not much different. Let's use slightly different names and sprinkle some syntactic sugaring macros, and we get:

( 1) ShAttrib1f b = 0.5;
( 2) ShProgram program = SH_BEGIN_PROGRAM() {
( 3)   ShInputAttrib1f input;
( 4)   ShOutputAttrib1f output;
( 5)   output = input * b;
( 6) } SH_END;
( 7) shCompile(program);
( 8) ShChannel< ShAttrib1f > a(1000), c(1000);
( 9) c = program << a; // run on a thousand floats

Sh defines its syntax in a way that saves you from all the stuff behind the scenes. It's also got a full featured library of standard functions, vector types (like ShAttrib3f and ShAttrib4f) to do computation on more than one scalar at once, extensible type support, backend code generators for GPUs and CPUs, and a pretty decent optimizer. On top of all that "standard stuff" it's got the ability to combine programs that you've already defined in various ways and generate lots of variations of programs at run time.

Let's go through a few things you can do with Sh now that we know how it works. You've already seen an example of a C++ for loop wrapped around embedded code. In Sh that might look as follows:

for (int i = 0; i < 9; i++)
  a = a * b;

And as we've discussed this will generate an Sh program with ten multiplies, as opposed to an Sh program with a for loop (Sh has for loops too but they use a slightly different syntax from C++ for loops). The same thing applies for functions:

ShAttrib1f blend(ShAttrib1f amount, ShAttrib1f a, ShAttrib1f b) {
  return amount * a + (1.0 - amount) * b;
}

Because of how Sh is written, that function will generate Sh program code when executed inside a program definition, and run immediately when used outside. So, Sh has the ability to define functions. But we got that ability for free, by hijacking it from C++! We can do the same with classes:

class Counter {
public:
  Counter() : m_count(0) {}
  void increment() { m_count++; }
  ShAttrib1i value() const { return m_count; }
private:
  ShAttrib1i m_count;
};

(the "i" in ShAttrib1i stands for integer here). In fact, Sh works with templates. Here's a complete complex number class that runs on GPUs, implemented using Sh:

std::complex< ShAttrib1f >

Pretty cool? You bet. The user gets all of C++ at their disposal, and the Sh developers don't have to implement it. Everybody wins! And the best part is that all the object-oriented structure built with C++ is never ever seen by the Sh compiler. It just acts as scaffolding, and falls away once the program is defined. This allows the Sh optimizer to be very agressive, and results in a highly efficient program to run, where the overhead for things like function calls is only payed once, at definition time, instead of every time the program is executed.

This article provides but a glimpse into what Sh makes possible. For an in-depth overview I suggest the book "Metaprogramming GPUs with Sh" or the best kind of learning tool: hands-on experience! :)