Treviso is just 26 kilometers from Venice, which was at the time the commercial hub of the Mediterranean, and the development of mathematics in medieval Europe is linked to the influences and demands of early capitalism. Trade in the Mediterranean brought new ideas from the Middle East, as well as old ideas which by that time Europe had long forgotten. It also demanded more complex and more frequent calculations, requiring more efficient mathematical tools, such as Arabic numerals.

Indeed, the replacement of Roman numerals, favoured by the “abacists”, by Arabic numerals, favored by the “algorists” (whose name was derived from successive mis-transliterations of al-Khwārizmī), was a matter of some controversy, as featured in this 1508 woodcut, depicting an arithmetic contest.

There’s a lot of interesting material in the Treviso, including some quite intricate divisions done in triple-radix arithmetic, but something which is interesting for the purposes of this post is the method of “casting out nines”, a method for quickly checking errors in addition and multiplication, which I’d never learned about before.

In the Treviso, casting out nines to check correctness of an addition is described as follows:

Besides this proof [performing subtraction] there is another. If you wish to check the sum by casting out nines, add the units, paying no attention to 9 or 0, but always considering each as nothing. And whenever the sum exceeds 9, subtract 9, and consider the remainder as the sum. Then the number arising from the sum will equal the sum of the numbers arising from the addends.

In more modern language, one might say: add the digits of the summands, reducing \(\mod\; 9\) along the way, then redo the computation (either multiplication or addition) with the new numbers.

For example, if we computed \(21432 + 5836 = 27268\), we can check the answer as follows: on the left-hand side, cast nines out of \(21432\) and \(5836\), and add the results, and on the right-hand side, cast nines out of \(27268\).

On the left-hand side, start with \(21432\), and add the digits \(\mod\;9\): we can cast out \(4,3,2\) since \(4+3+2 = 9\), so \[2+1+4+3+2 \equiv 2+1 \equiv 3 \;\;(\mod\;9).\] For \(5836\), we cast out \(3,6\) since they sum to nine, so \[5+8+3+6 \equiv 5+8 \equiv 4 \;\;(\mod\;9).\] Adding the results gives \(7 = 3+4\).

On the right-hand side, cast nines out of \(27268\) as follows: cast out \(2,7\) since \(2+7=9\), and add the remaining digits, \(2+6+8 = 16\), which is \(7\;\;(\mod\;9)\). So, both sides agree. If, however, we had made a mistake and computed \(27265\) instead of \(27268\), casting out nines would give \(5\), which is not \(7\). However, if we had found \(27259\) instead of \(27268\) by mistake, casting out nines would give \(7\), so casting out nines does not guarantee detection of an error.

Why does this work? First, think about what it means to write a number in base \(10\): let \(n\) be a number and \(c_k,\ldots,c_0\) be its digits. Then \[
n = c_0 + c_1 10 + c_2 100 + \cdots + c_k 10^k.
\] Now, consider the equation \[
9 = 10 - 1.
\] This tells us that \(10 \equiv 1 \;\;(\mod\; 9)\). Reducing the previous equation \(\mod\; 9\), we get \[
\begin{array}{rcl}
n &=& c_0 + c_1 10 + c_2 100 + \cdots + c_k 10^k \\
&\equiv& c_0 + c_1 1 + c_2 1^2 + \cdots + c_k 1^k \;\;(\mod\;9)\\
&\equiv& c_0 + c_1 + c_2 + \cdots + c_k \;\;(\mod\;9). \\
\end{array}
\] So, to reduce a number \(\mod\; 9\), we can just compute the digit sum \((\mod\; 9\). This makes computing the reduction of a base-\(10\) number \(\mod\; 9\) ludicrously efficient to do by hand. (As an aside, a more elegant proof of this statement can be found on page 68 of Carl E. Linderholm’s 1972 classic, *Mathematics Made Difficult*).

Casting out nines, therefore, amounts to reduction \(\mod\; 9\), and redoing the computation in \(\mathbb{Z}/9\mathbb{Z}\). Since the reduction is a ring homomorphism, it preserves addition and multiplication, so if we got a different answer after casting out nines, there must have been a mistake. However, if we compute a wrong answer that differs from the correct one by a multiple of \(9\), then the error will not be detected: in the example above, we have \[ 27259 = 21432 + 5836 - 9 \equiv 21432 + 5836 \;\;(\mod\; 9), \] and assuming that all errors are equally likely, this happens with probability \(1/9\), so we detect an error with probability \(8/9\), or \(88\%\).

To increase the error-detection probability, we can generalize the method, by reducing \(\mod\; 99\), \(\mod\; 999\), etc. For instance, to check \[ 197702369162 = 842987 \times 234526 \] by casting out 99, we group digits into pairs and take sums: \[ \begin{array}{rcl} (19+77)+(2+36)+(91+62) &=& 96 + 38 + (100 + 53) \\ &\equiv& (96 + 38) + 1 + 53 \;\;(\mod\; 99) \\ &\equiv& (1 + 34) + 1 + 53 \;\;(\mod\; 99) \\ &\equiv& 89 \;\;(\mod\; 99); \end{array} \] we have \(84+29+87=200 \equiv 2 \;\;(\mod\; 99)\), and \(23+45+26 = 94\), so the product is \(2\times 94 = 188 \equiv 89\), and the multiplication is correct with probability just under \(99\%\).

The point here is that, given a number in base \(10\), it’s easy to convert to a number in base \(10^k\) by grouping digits, and given a number in base \(10^k\), it’s easy to reduce \(\mod\; m = 10^k -1\), since \[ 10^k \equiv 1 \;\;(\mod\; m). \] And, in fact, although the \(1\) on the right-hand side is a particularly easy number to deal with, things are still easy if we pick \(m\) to be close to \(10^k\), so that the right-hand side becomes a small number. In this case we don’t quite get to take a digit sum, but it’s still fairly simple. For example, setting \(p = 997 = 10^3 - 3\), we have \(1000 \equiv 3 \mod\; p\), so to reduce \(987459324 \mod\; p\), we take \[ \begin{array}{rcl} 987(1000^2) + 459(1000) + 324 &\equiv& 987(9) + 459(3) + 324 \;\;(\mod\; p) \\ &\equiv& 8883 + 1377 + 324 \;\;(\mod\; p) \\ &\equiv& 8(3) + 883 + 1(3) + 377 + 324 \;\;(\mod\; p) \\ &\equiv& 1611 \equiv 614 \;\;(\mod\; p). \end{array} \]

Putting all this together, we see that it’s easy to reduce a number written in base \(10\) by any number which is close to a power of \(10\). But there’s nothing really special about base \(10\) – for exactly the same reasons, it’s easy to reduce a number written in base \(b\) modulo a number close to a power of \(b\).

For computers, which use binary, this means that reduction modulo numbers close to powers of \(2\) is very fast. In particular, if one seeks to do arithmetic \(\mod\; p\) on a computer, it’s good to choose a prime \(p\) which is close to a power of \(2\).

This (partially) explains the “25519” in “Curve25519”, one of the best choices of curve for elliptic-curve cryptography: the curve is defined modulo \(p =2^{255} - 19\), and, as discussed on page 13 of the paper, the prime \(p\) was chosen among primes near \(256\) bits as the one closest to a power of \(2\) (beating out \(2^{255} +95\), \(2^{255} - 31\), \(2^{254} + 79\), \(2^{253} + 51\), and \(2^{253} + 39\)).

*Thanks to Peter Schwabe for lending me his copy of Capitalism & Arithmetic*.

Last time, I talked a little bit about an implementation of a trie to store frequency information about n-grams. The problem is that the naïve implementation of a trie is much more compact than the source data, but still not small enough.

One approach to dealing with this problem is to do evil bit-twiddling hacks to reduce the size, but ultimately, this just gives you worse code and little real benefit.

In information theory, the *entropy*, also called *Shannon entropy* after Claude Shannon, is a measure of the information content of a random variable.

Information theory is pretty interesting, but for our purposes we’ll just consider two facts. First, if we have a random variable that takes values in an alphabet of size \(n\), then the entropy is maximized by a uniform distribution and here the entropy is \(\lg n\) bits. Second, Shannon’s source coding theorem tells us that, asymptotically speaking, the best compression rate possible is the entropy.

Now, let’s think about the space complexity of a naïve tree data structure, where we number all of the vertices, and store for each vertex a list of its children. For \(n\) vertices, this representation takes \(\Omega(n\log n)\) bits. But suppose that our vertices are unlabeled. Let \(r(n)\) be the number of unlabeled trees on \(n\) vertices with a specified root vertex (OEIS:A000081). Then, as \(n \rightarrow \infty\), \[
r(n) \sim D \alpha^n / n^{3/2},
\] where \(D = 0.439\ldots\) and \(\alpha = 2.95\ldots\) are constants, so if we simply numbered these and used the number to identify the tree, we could use only \[
\log r(n) = 2n - \Theta(\log n)
\] bits. Obviously, this isn’t a practical data structure (how can you perform tree operations on a number), but the point is that the upper bound on the most compact representation is actually linear. There’s a big gap between linear and \(n\log n\), so it’s an interesting question to ask how we could have practical succint trees. For more on this, take a look at this paper, *Succinct Trees in Practice*.

The point, of course, isn’t that we should expect to get all the way down to sub-linear space complexity (in fact, if we want to label nodes with up to \(\sigma\) different labels, we need an additional \(O(n\log \sigma)\) bits for that), but just that we shouldn’t be surprised if we can do much better than a naïve approach.

It turns out that this problem has been thought about before: for instance, there’s a 2009 paper by Germann, Joanis, and Larkin, called *Tightly Packed Tries: How to Fit Large Models in Memory, and Make them Load Fast, Too*. I implemented their data structure, which encodes the nodes in depth-first order as follows. For each node, write its frequency in LEB128, then write the frequency of its children, also in base-128. If it is not a leaf node, then we have already written all of its children, since the encoding is depth-first, so we can compute the byte offset of the child node from the current node. Finally, we write a list of (key, offset) pairs, with an index size in front so we know when the index ends.

This is maybe a little confusing, but there’s an annotated version of the binary format for a simple example here that makes it more clear.

The TPT data structure is basically the same as the original trie structure: you have nodes, which store some frequency counts, and a list of child nodes. But it’s much more efficient, for two reasons.

The first is the use of a variable-length encoding for integers. Zipf’s Law is the observation that much data in linguistics and other sciences follows a power-law distribution. From Wikipedia: “Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.”

The LEB128 code has a slight overhead: 1 bit per byte. But the vast majority of the frequency counts will be small, with only a few large numbers, so it’s much more efficient.

The second reason is that instead of storing pointers to the child nodes, we store relative byte offsets, which are LEB128-encoded. By using relative addressing instead of absolute addressing, the numbers tend to be smaller, since we usually write child nodes close to their parents in the file. Smaller numbers mean fewer bytes written. Moreover, relative offsets mean that we don’t need to put things in physical memory: to deal with large data sets, just `mmap`

and call it a day.

In the toy example I linked, we write the whole trie in 53 bytes, instead of 280, so it’s more than five times smaller. For a bigger data set, like the English 1-million 2-grams, writing the trie this way takes 710 MB, compared to about 7 GB with the previous method (the original data set is ~80 GB).

]]>Recently, I’ve been playing with some of the data. In the past, I’ve played with randomly generating text with a Markov chain. You have some source text corpus, and you generate random text by picking the next word according to the probability distribution of some source text, conditioned on the previous \(k\) words.

One of the available n-gram data sets contains the million most frequently English words, and I thought it’d be fun to try to run some kind of Markov-chain algorithm using this as the source corpus.

Since there’s a lot of data (the 2-gram dataset is 80 GB uncompressed, and the 3-gram dataset is 280 GB uncompressed), we want to have a compact datastructure to hold it. The hope is that with enough magic, the needed data can fit into 16 GB of RAM without swapping. Tries seem like a good fit for the problem, since all of the prefixes are shared between n-grams. In a trie, each node stores a map of keys to child nodes, but the keys for each node are unstored: the key for a node is given by the path from the root. Since we want to do some things with probabilities, each node has a frequency count, as well as a count of the sum of the frequencies of all child nodes.

But there’s a lot of data, so maybe we need to think about the way to implement this in a space-efficient way? I don’t care about portability to non-x64 systems (for *this* program), so we can do some bit-twiddling: the x64 address space only uses the lower 48 bits of the pointer to address memory, while the upper 16 bits are either all \(1\) or all \(0\).

This means, for instance, that you can have an implementation of (sufficiently small) vectors where the bookkeeping information is stored in the array pointer itself. To save memory, the node children are saved in a tagged-pointer data structure which is a pointer (tagged with information about array size) to an array of tagged pointers (tagged with the character used as a key for each node). The array is kept sorted, so we can do lookups fairly quickly with a binary search, and the whole thing takes only 8 bytes for the pointer to the array, plus 8 bytes times the array size (rounded up to the closest power of 2). Rounding to the closest power of 2 ends up wasting very little space, since the distribution of the number of children seems to follow a power law anyways.

The next problem is that, having compacted the size of all of the data we need to store, we’re left with having to do a ton of small allocations of memory blocks that are nearly pointer-sized. Not only is this inefficient in terms of speed, it’s also space-inefficient: the memory allocator has to keep track of what it’s allocated, which takes a minimum of 8 bytes. This is fairly easily solved by performing allocations out of memory pools of fixed size.

In the end, we can build a trie for the 2-gram dataset in about 4 GB of memory, and we can almost load the 3-gram dataset. But much better results are possible. The code I have now isn’t a great solution, since even after applying a whole number of too-clever hacks, it still doesn’t work, because the basic data structure isn’t a great fit. Having to keep pointers to different blocks of memory isn’t a good idea, since not only are the pointers really large (8 bytes), but they also preclude easy serialization, so large datasets can’t just be `mmap`

’d. Not only can you not take advantage of the kernel’s paging system, you also have to reload the source data every time.

More importantly, the approach of trying to make the trie smaller is really focusing on the wrong question: “how do you use bit-twiddling hacks to shrink the implementation of this datastructure?” rather than “what kind of data structure uses the (information-theoretic) properties of the data to compress it, in such a way that we can perform computation on compressed data?”. Answering the first question instead of the second leads to unreadable code that’s too clever by half, but doesn’t really get you where you need to go. Answering the second question leads to really interesting things like succinct data structures and is generally the better way to go.

Indeed, it should be possible to do much better: this paper, for instance, claims to have a datastructure that can compress the entire terabyte-sized n-gram dataset (\(n = 1,2,3,4,5\)) into a 16GB datastructure.

(Addendum: “part 1” indicates hopefully more posts on the topic, but I don’t have anything written yet. Also, as it’s not directly-KDE-related, just (hopefully) interesting, if you’re reading this on planetKDE and don’t want to see it, complain loudly and I’ll leave future posts out of syndication…)

]]>However, when you have a KML file with `ExtendedData`

tags like this:

```
<ExtendedData>
<SchemaData schemaUrl="#FED_35_final">
<SimpleData name="FEDNUM">35032</SimpleData>
<SimpleData name="ED_NAMEE">Guelph</SimpleData>
<SimpleData name="ED_NAMEF">Guelph</SimpleData>
</SchemaData>
</ExtendedData>
```

and you try to convert it to a shapefile, it may not work. The reason is that you need to have GDAL built with LibKML support. If you do, you’ll see something like this in the output of `ogrinfo --formats`

:

```
-> "LIBKML" (read/write)
-> "KML" (read/write)
```

If not, you’ll just see the entry for “KML”. Note that here “KML” refers to GDAL’s builtin KML support, which should always be there, but does not support extended data attributes. The Arch packages for GDAL don’t have LibKML support enabled, and it turns out that GDAL needs to have a more recent version of LibKML. So I adapted an existing AUR package to make `libkml-git`

and put it on the AUR. To get GDAL to use it, simply build one of the GDAL packages on the AUR and edit the `PKGBUILD`

to add `--with-libkml`

and `libkml`

to the `depends`

field.

Hopefully that’s helpful to someone down the road.

]]>`clang -cc1 -fdump-record-layouts ppfile.cpp`

on a preprocessed C++ file, produced using, e.g.,

`clang -E -I/probably/lots/of/include/paths file.cpp`

gives output like:

```
*** Dumping AST Record Layout
0 | class StarObject
0 | class SkyObject (primary base)
0 | class SkyPoint (primary base)
0 | (SkyPoint vtable pointer)
0 | (SkyPoint vftable pointer)
16 | long double lastPrecessJD
32 | class dms RA0
32 | double D
| [sizeof=8, dsize=8, align=8
| nvsize=8, nvalign=8]
...(snipped)...
184 | float B
188 | float V
| [sizeof=192, dsize=192, align=16
| nvsize=192, nvalign=16]
```

Notice that the `lastPrecessJD`

variable is stored as a `long double`

, with possibly 63 bits of precision instead of the usual 53 bits given by a `double`

. In practice, `long double`

has 16-byte storage and alignment. Since the vtable takes up only 8 bytes (on 64-bit), we waste 8 bytes on padding. Moreover, we then take up 16 bytes to store `lastPrecessJD`

, but using a program like the following:

```
#include <stdio.h>
#include <math.h>
int main()
{
double jd2000 = 2451545.0;
double delta = nextafter(jd2000,jd2000+1) - jd2000;
printf("delta: %.30f\n", delta);
return 0;
}
```

we can compute that at the year 2000, the minimum time step at (64-bit) double precision is approximately *40 microseconds*, so it’s not clear that we gain anything by using 80-bit long doubles instead of 64-bit doubles. Changing the `long double`

to `double`

(and placing it last, though this isn’t strictly necessary) results in memory layout for the SkyPoint class like so:

```
*** Dumping AST Record Layout
0 | class SkyPoint
0 | (SkyPoint vtable pointer)
0 | (SkyPoint vftable pointer)
8 | class dms RA0
8 | double D
| [sizeof=8, dsize=8, align=8
| nvsize=8, nvalign=8]
...(snipped)...
48 | class dms Az
48 | double D
| [sizeof=8, dsize=8, align=8
| nvsize=8, nvalign=8]
56 | double lastPrecessJD
| [sizeof=64, dsize=64, align=8
| nvsize=64, nvalign=8]
```

This saves 16 bytes, cutting the size to 64 bytes from 80^{1}. Since KStars suffers from abuse of complex inheritance heirarchies and everything-is-an-object, this is 16 bytes saved for every single object in the sky.

Doing some simple rearrangements of the data in other classes means we can also save 8 bytes per StarObject and DeepSkyObject. Overall, these changes give approximately a **10% reduction in memory usage**, just from removing padding.

This also has the benefit that the SkyPoint data fits in a single cache line, though I don’t think this really makes a difference given the inefficiencies in the rest of the code, and the fact that none of our data has any thought put into alignment, but it’s nice to have.↩

Note (2013-12-28): This information is out of date. In any case, you should be using the Arch wiki as a reference, as it’s kept up-to-date; this is just for my own memory.

**Note (2013-12-28): I would not recommend using bcache in combination with btrfs. Filesystem corruption may result due to unknown interactions between bcache and btrfs. There are some posts on various mailing lists about the issue.**

Follow the instructions on the Arch wiki: install the `bcache-tools-git`

package from the AUR. Once you have the partitions you want to use as the cache and the backing store, run `make-bcache`

to create the bcache device, which appears as `/dev/bcache0`

.

Now, instead of just creating a btrfs filesystem on the new `/dev/bcache0`

device, we follow the instructions on this page. So we end up with the following subvolumes:

```
hdevalence@noether /> sudo btrfs subvolume list -a .
ID 256 gen 12022 top level 5 path <FS_TREE>/__active
ID 257 gen 12146 top level 5 path <FS_TREE>/__active/home
ID 258 gen 12142 top level 5 path <FS_TREE>/__active/var
ID 259 gen 11893 top level 5 path <FS_TREE>/__active/usr
```

The mount options in `/etc/fstab`

are:

`rw,noatime,ssd,discard,space_cache,compress=lzo,subvol=__active`

Next, we go back to the bcache instructions, to set up `mkinitcpio`

to generate a kernel image that can pick up the bcache device. Picking the udev option, we copy the udev script into

`/usr/lib/initcpio/install/bcache_udev`

Then add `bcache`

to `MODULES`

, and edit `HOOKS`

to look like

`HOOKS="base udev autodetect modconf block bcache_udev filesystems keyboard fsck btrfs_advanced"`

It’s important not to forget to add `bcache`

to `MODULES`

, or else the system won’t boot.

Finally, by default bcache uses writethrough caching. I don’t think that my SSD is too unreliable, and I have backups, so I do

```
[root@noether ~]# echo writeback > /sys/block/bcache0/bcache/cache_mode
[root@noether ~]# cat /sys/block/bcache0/bcache/cache_mode
writethrough [writeback] writearound none
```

Note that the documentation on the bcache site was slightly incorrect, at least when I was looking at it. It says to run

`# echo writeback > /sys/block/bcache0/cache_mode`

but since that’s not the correct path, it just gives

`/sys/block/bcache0/cache_mode: No such file or directory`

Next, we install X11 and video drivers. I picked the `radeon`

drivers instead of `fglrx`

, since `radeon`

now supports the Southern Islands chipsets, and AMD’s proprietary drivers are utter crap. The packages to install are

`xorg-server xorg-server-utils xorg-xinit xf86-video-ati mesa mesa-demos `

then start X and run `glxgears`

, `glxinfo`

, etc. to check that it’s working. Finally, enable dynamic power management by editing the kernel parameters to add `radeon.dpm=1`

.

Finally, install KDE, etc. as normal.

]]>To make these, we generated some circular patterns and wrote out GCode that the laser at the Hacklab can read, like so:

```
import Definitions
import PolylineFormats
-- makes a circle of radius r, in millimeters
circle :: Float -> Polyline
circle r = [(r*cos (i*2*pi/nPts), r*sin (i*2*pi/nPts)) | i <- [0..nPts]]
where nPts = 128.
main :: IO ()
main = do
let writeCircles filename rs = writeFile filename $ hacklabLaserGCode
$ map circle rs
writeCircles "outOdd.ngc" [105,115..225]
writeCircles "outEven.ngc" [100,110..220]
```

Here we’re using some of the libraries from ImplicitCAD, but with the imports relativized, since I don’t have a working Cabal installation on my Chromebook. When we score the paper with the laser, it makes a mountain fold, and since we’re trying to pleat the paper, we need to do one pattern on each side. So we generate a set of odd-numbered rings and a set of even-numbered rings. It would also be interesting to try experiments with unevenly spaced rings (here they’re all 5mm apart) or moving the rings so that they’re offset instead of perfectly concentric.

Since we’re cutting on both sides, we need to flip the paper while keeping it in the exact same place. Our solution was to score the odd side (with the outer ring) first, and increase the laser power to cutting strength for the last ring. Then, we hold the rest of the sheet in place and flip the disc over into the same hole.

Folding the scored pattern is fairly time-consuming, but not especially difficult. It’s best to work inward from the outside, folding one pleat at a time. Folds along curved lines cause bending in the paper and vice versa, so gently bending the paper as you fold it can help the pleats to pop into the right shape. If you’re curious about the shape, there’s a bit more detail in my post from last year.

]]>So, the point is that although we want to be able to have users use OpenCL if they have one of these implementations installed, we can’t rely on it. The solution I arrived at is to have two classes, `KSBuffer`

and `KSContext`

, which respectively hold a buffer of points to do computation on, and manage contextual state for the computation.

These classes use d-pointers, and for each of them we have two classes that inherit the *Private classes. One of these uses OpenCL, while the other one just uses plain Eigen on the CPU. This way, the rest of the code that wants to do computation on buffers of points doesn’t need to have to know anything about the implementation details, and we can make OpenCL an optional dependency both at compile- and run-time.

We can also run a short test to see how the performance has changed.

As a small test, we create a buffer of 1 million sky points, and then do the steps needed to compute the apparent position of these points at a given time:

- Precession
- Nutation
- Aberration
- Conversion to horizontal coordinates (i.e., coordinates for a given location and time).

Running these steps, we get:

- Old algorithms:
**3947ms** - New algorithms (with Eigen):
**70ms (56x baseline)** - New algorithms (with OpenCL):
**30ms (132x baseline)**

So, this is a pretty good result so far, with the following caveats:

- None of the new code is optimized.
- The benchmark is pretty synthetic, and we usually don’t process a million points at once.

I’m looking forward to seeing how much benefit we can get once we integrate the new algorithms into the sky-component hierarchy, and whether we can optimize this further.

Addendum: since this got posted to Phoronix, it’s good to point out that the dramatic improvement is actually from better algorithms, **not** from using OpenCL. For information, see some of my previous posts on the algorithmic changes.

Consider an observer on earth, looking at some star. Relative to the earth, the observer is stationary, but relative to the solar system the observer is moving at the same speed as the earth. The Earth is moving quite quickly around the sun, and this speed is large enough that we can see the effects of relativity: the angle of the beam changes, because the speed of light is constant. (Imagine how rain appears to fall diagonally when travelling in a car, but here instead of just adding the velocities, we use relativity, since the speed of light is constant).

The consequence is that points in the sky appear to move in ellipses as the year passes. Points near the ecliptic plane will travel in flattened ellipses, moving back and forth in a line, while points at the poles will travel in circles.

In the existing implementation, every time we want to compute the position of a point, we use four parameters to estimate the effect of abberation, and then do a lengthy calculation along these lines:

```
double dRA = -1.0 * K * ( cosRA * cosL * cosOb + sinRA * sinL )/cosDec
+ e * K * ( cosRA * cosP * cosOb + sinRA * sinP )/cosDec;
double dDec = -1.0 * K * ( cosL * cosOb * ( tanOb * cosDec - sinRA * sinDec ) + cosRA * sinDec * sinL )
+ e * K * ( cosP * cosOb * ( tanOb * cosDec - sinRA * sinDec ) + cosRA * sinDec * sinP );
```

This requires a ton of trigonometry calls, it’s difficult to understand what all the terms are, and there’s no work being shared between calculations.

The new algorithm uses a stereographic projection to do a lot of the work. If you don’t know what a stereographic projection is, it’s very simple: to project a point from the sphere onto the plane, you simply draw the line that passes through that point and the north pole, and look at where that line intersects the plane. The only snag is that the north pole itself gets sent to infinity, and though this isn’t a big problem mathematically (we can work in projective space) it’s not good for computations, so we want to avoid doing that. Wikipedia has more information.

My new implementation is roughly along the lines of this paper, except for some minor details. Geometrically, the effect of aberration is to shift the position of a point towards the direction of motion so that \[ \tan \frac{\theta'}{2} = \sqrt{\frac{c-v}{c+v}}\tan\frac{\theta}{2}, \] where \(\theta\) and \(\theta'\) are respectively the true and apparent angles between the point and the direction of motion. (Note that MathJaX isn’t rendered on PlanetKDE or RSS, if the formulas are not displayed properly). It turns out that if the direction of motion is aligned with the south pole, then after doing a stereographic projection, this effect is just scaling by \[ \sqrt{\frac{c-v}{c+v}}. \]

Thus we can compute aberration by rotating our coordinate system to align it with the Earth’s motion, projecting, scaling, and deprojecting. However, KStars already has an implementation of a very accurate method for computing the motion of the Earth at any point in time, so this is even more accurate, because we’re doing the full relativistic calculation instead of just using a special-case method.

What’s more, we don’t need to do any trigonometry at all, just a few simple multiplications and divisions. And unlike the old method, we can share work between projections, computing the velocity and scaling factor just once.

The only worry is that we want to avoid the singularity, but this too is not a big deal, since we can project through the *south* pole and multiply by the reciprocal. Moreover, since we’re branching based on the location of the point, if we order our data to have spatial locality, then the branch is predictable and basically goes away. (I’m saving more details on performance changes in KStars for another post).

Before, we were using spherical coordinates, and the expression to get the change caused by aberration was really awkward. But here, we’re using a coordinate system which fits the problem, and describing the change is simple: it’s just a scaling factor!

It’s a really nice example of how *choosing the right language* to use to describe the problem you’re having lets you get a much better solution.

In the old code, everything was done with spherical trigonometry, which means that doing any calculation requires many calls to `sin`

, `cos`

, `tan`

and friends, and also causes a lot of problems because all of the coordinate systems have singularities at the north and south poles. The biggest disadvantage, however, is that it means that it’s impossible to seperate calculating *what the transformation is* from *actually applying the transformation*.

For example, consider the old implementation of nutation:

```
void SkyPoint::nutate(const KSNumbers *num) {
double cosRA, sinRA, cosDec, sinDec, tanDec;
double cosOb, sinOb;
RA.SinCos( sinRA, cosRA );
Dec.SinCos( sinDec, cosDec );
num->obliquity()->SinCos( sinOb, cosOb );
//Step 2: Nutation
if ( fabs( Dec.Degrees() ) < 80.0 ) { //approximate method
tanDec = sinDec/cosDec;
double dRA = num->dEcLong()*( cosOb + sinOb*sinRA*tanDec ) - num->dObliq()*cosRA*tanDec;
double dDec = num->dEcLong()*( sinOb*cosRA ) + num->dObliq()*sinRA;
RA.setD( RA.Degrees() + dRA );
Dec.setD( Dec.Degrees() + dDec );
} else { //exact method
dms EcLong, EcLat;
findEcliptic( num->obliquity(), EcLong, EcLat );
//Add dEcLong to the Ecliptic Longitude
dms newLong( EcLong.Degrees() + num->dEcLong() );
setFromEcliptic( num->obliquity(), newLong, EcLat );
}
}
```

There are a few things to notice here:

Because of the overuse of trig functions, we introduce a lot of useless variables just so that we can use some GNU extension to compute

`sin`

and`cos`

at the same time for a slight speed boost, costing us in readability.Again because of speed considerations, we need to use an approximate method instead of an exact method when we can get away with it.

This method doesn’t actually take the date we want to “nutate to” as a parameter. Instead, you have to pass in a

`KSNumbers`

class, which is basically a huge mess of unrelated variables that depend on time and are used for various computations.

Here is the new implementation:

```
namespace Convert {
...
CoordConversion Nutate(const JulianDate jd)
{
double dEcLong, dObliq;
AstroVars::nutationVars(jd, &dEcLong, &dObliq);
//Add dEcLong to the Ecliptic Longitude
CoordConversion rot = AngleAxisd(dEcLong*DEG2RAD,Vector3d::UnitY()).matrix();
return EclToEq(jd) * rot * EqToEcl(jd);
}
...
}
```

which gives a matrix that can be used as follows:

```
JulianDate jd = ...;
EquatorialCoord point = ...;
EquatorialCoord nutated = Convert::Nutate(jd) * point;
```

Notice:

When we want to compute the nutation for a particular date, we just give

*that date*, and not a huge bundle of numbers. The particular numbers that we need are in their own function,`nutationVars`

.It’s roughly the same amount of work to compute

`Convert::nutationVars`

as to compute one nutated point with the exact method. But then for every point, we just have to multiply a vector by a matrix – 9 multiplications and 6 additions – instead of having to redo all the work.Because the matrix is computed only once, there’s no additional cost to using the exact method instead of the approximation.

So, by doing things in a more elegant way, we gain readability, cut the size of the codebase, improve speed, and allow for further optimization by batch processing.

So far, I’ve done this for all of the coordinate conversions, except for the computation of aberration, which has a new implementation but is still trig-based, since unlike the other computations, aberration is not an orthogonal map. It should be possible to optimize it more, though. Also, all of the new code has unit tests, so that we can check whether or not it is correct, unlike the old code which has neither tests nor a coherent idea of what “correct behaviour” might be.

The next step is to set up code that stores and works with arrays of objects, so that we can actually do batch processing.

]]>