# Planet Collabora

## May 29, 2020

### Sjoerd Simons

#### Debian armhf VM on arm64 server

htmltidy failed to parse this html

#### Debian Jessie on Raspberry Pi 2

htmltidy failed to parse this html

#### Hacker Cat

htmltidy failed to parse this html

#### Meego Tablet now with cat support

htmltidy failed to parse this html

#### Codec bitrate control

htmltidy failed to parse this html

#### Just one of those days

htmltidy failed to parse this html

#### pkg-pulseaudio needs you!

htmltidy failed to parse this html

htmltidy failed to parse this html

#### Telepathy and VP8!

htmltidy failed to parse this html

#### First post!

htmltidy failed to parse this html

## April 23, 2020

### Alyssa Rosenzweig

#### From Bifrost to Panfrost - deep dive into the first render

In Panfrost’s infancy, community members Connor Abbott and Lyude Paul largely reverse-engineered Bifrost and built a proof-of-concept shader dis/assembler. Meanwhile, I focused on the Midgard architecture (Mali T600+), building an OpenGL driver alongside developers like Collaboran Tomeu Vizoso.

As Midgard support has grown – including initial GLES3 support – we have now turned our attention to building a Bifrost driver. We at Collabora got to work in late February, with Tomeu porting the Panfrost command stream, while I built up a new Bifrost compiler.

This week, we’ve reached our first major milestone: the first 3D renders on Bifrost, including basic texture support!

The interface to a modern GPU has two components, the fixed-function command stream and the programmable instruction set architecture. The command stream controls the hardware, dispatching shaders and containing the state required by OpenGL or Vulkan. By contrast, the instruction set encodes the shaders themselves, as with any programmable architecture. Thus the GPU driver contains two major components, generating the command stream and compiling programs respectively.

From Midgard to Bifrost, there have been few changes to the command stream. After all, both architectures feature approximately the same OpenGL and Vulkan capabilities, and the fixed-function hardware has not required much driver-visible optimization. The largest changes involve the interfaces between the shaders and the command stream, including the titular shader descriptors. Indeed, squinting command stream traces from Midgard and Bifrost look similar – but the long tail of minor updates implies a nontrivial Panfrost port.

But the Bifrost instruction set, on the other hand? A rather different story.

Let’s dive in.

## Compiler

Bifrost’s instruction set was redesigned completely from Midgard’s, requiring us to build a free software compiler targeting Bifrost from scratch. Midgard’s architecture is characterized as:

• Vector. 128-bit arithmetic logic unit (ALU) allows 32-bit 4-channel SIMD.

• Very Long Instruction Word. 5 blocks – scalar/vector add/multiply and special function unit – operate with partial parallelism across 2 pipeline stages.

Vector and VLIW architectures promise high-performance in theory, and in theory, theory and practice are the same. In practice, these architectures are extremely difficult to compile efficiently for. Automatic vectorization is difficult; if a shader uses a 3-channel 32-bit vector (vec3), most likely the extra channel will go unused, wasting resources. Likewise, scheduling for VLIW architectures is difficult and slots often go unused, again wasting resources and preventing shaders from reaching peak architectural performance.

Bifrost by contrast is:

• Scalarish. 32-bit ALU. 32-bit operations are purely scalar, 16-bit/2-channel and 8-bit/4-channel SIMD. 64-bit operation requires a special half-performance mode.

• Slightly Long Instruction Word. Bifrost has 2 blocks – fused multiply-add and add – pipelined without parallelism. Simplified special function unit.

This redesign promises better performance - and a redesign of Panfrost’s compiler, too.

## Bifrost Intermediate Representation

At the heart of a modern compiler is one or more Intermediate Representations (IRs). In Mesa, OpenGL shaders are parsed into GLSL IR – a tree IR expressing language-visible types – which is converted to NIR – a flat static single assignment IR enabling powerful optimizations. Backends convert NIR to a custom backend IR, whose design seals the fate of a compiler. A poor IR design can impede the growth and harm the performance of the entire compiler, while a well-designed IR enables complex features to be implemented naturally with simple, fast code. There is no one-size-fits-all IR; the design necessarily is built to make certain compiler passes (like algebraic optimization) simple at the expense of others (like register allocation), justifying the use of multiple IRs within a compiler. Further, IR design permeates every aspect of the compiler, so IR changes to a mature compiler are difficult. Both Intel and AMD GPU compilers have required ground-up rewrites to fix IR issues, so I was eager to get the Bifrost IR (BIR) right to begin.

An IR is simply a set of data structures representing a program. Typically backends use a “flat” IR: a list of blocks, where each block contains a list of instructions. But how do we store an instruction?

We could reuse the hardware’s instruction format as-is, with abstract variables instead of registers. It’s tempting, and for simple architectures, it can work. The initial NIR-to-BIR pass is a bit harder than with abstract instructions, but final code generation is trivial, since the packing is already done.

Unfortunately, real world instructions sets are irregular and as quirky as their compilers’ developers. Further, they are tightly packed to be small instead of simple. For final assembly, we will always need to pack the machine formats, but with this design, we also need to unpack them. Worse, the machine irregularities will permeate every aspect of the compiler since they are now embedded into the IR. On Bifrost, for example, most operations have multiple unrelated encodings; this design would force much of the compiler to be duplicated.

So what if the IR is entirely machine-independent, compiling in the abstract and converting to the machine-specific form at the very end. Such IRs are helpful; in Mesa, the machine-independent NIR facilitates sharing of powerful optimizations across drivers. Unfortunately, some design traits really do affect our compiler. Is there a compromise?

Notice the first IR simplifies packing at the expense of the rest of the compiler, whereas the second simplifies NIR-to-BIR at the expense of machine-specific passes. All designs trade complexity in one part of the compiler for another. Hence a good IR simplifies the hardest compiler passes at the expense of the easiest. For us, scheduling and register allocation are NP-complete problems requiring complex heuristics, whereas NIR-to-BIR and packing are linear translations with straightforward rules. Ideally, our IR simplifies scheduling and register allocation, pushing the complexity into NIR-to-BIR and packing. For Bifrost, this yields one golden rule:

A single BIR instruction corresponds to a single Bifrost instruction.

While single instructions may move during scheduling and be rewired from register allocation, the operation is unaffected. On the other hand, within an instruction:

BIR instructions are purely abstract without packing.

By delaying packing, we simplify manipulation. So NIR-to-BIR is complicated by the one-to-many mapping of NIR to Bifrost opcodes with special functions; meanwhile, final code generation is complicated by the pattern matching required to infer packing from abstract instructions. But by pushing this complexity to the edges, the optimizations in between are greatly simplified.

But speaking of IR mistakes, there is one other common issue…

## 16-bit Support

One oversight I made in designing the Midgard IR – an oversight shared by the IR of many other compilers in Mesa – is often assuming instructions to operate on 32-bit data only. In OpenGL with older Mesa versions, this assumption was true as the default float and integer types are 32-bit. However, the assumption is problematic for OpenCL where 8-, 16-, and 64-bit types are first class. Even for OpenGL, it is suboptimal. While the specification mandates minimum precision requirement for operation, fulfillable with 32-bit arithmetic, on shaders using mediump precision qualifiers we may use 16-bit arithmetic instead. About a month ago, Mesa landed support for optimizing mediump fragment shaders to use 16-bit arithmetic, so for Bifrost, we want to make sure we can take advantage of these optimizations.

The benefit of reduced precision is two-fold. First, shader computations need to be stored in registers, but space in the register file is scarce, so we need to conserve it. If a shader runs out of registers, it spills to main memory, which is slow, so by using 16-bit types instead of 32-bit, we can reduce spilling. Second, although Bifrost is scalar for 32-bit, it is 2-channel vector for 16-bit. As mentioned, automatic vectorization is difficult, but if shaders are vectorized to begin, the compiler can take advantage of this.

As an example, to add 32-bit vector R0-R3 with R4-R7, we need code like:

ADD.f32 R0, R0, R4
ADD.f32 R3, R3, R7

But in 16-bit mode with vectors R0-R1 and R2-R3:

ADD.v2f16 R0, R0, R2
ADD.v2f16 R1, R1, R3

Notice both register usage and instruction count are halved. How do we support this in Mesa? Mesa pipes 16-bitness through NIR into our backend, so we must ensure types are respected across our backend compiler. To do so, we include types explicitly in our backend intermediate representation, which the NIR-to-BIR pass simply passes through from NIR. Certain backend optimizations have to be careful to respect these types, whereas others work with little change provided the IR is well-designed. Scheduling is mostly unaffected. But where are there major changes?

Enter register allocation.

## Register allocation

A fundamental problem every backend compiler faces is register allocation, mapping abstract IR variables to concrete machine registers:

0 = load uniform #0
3 = mul 2, 2

$$\rightarrow$$

R0 = load uniform #0
R0 = mul R0, R0

Traditionally, register allocation is modeled by a graph, where each IR variable is represented by a node. Any variables that are simultaneously live, in the sense that both of their values will be read later in the program, are said to interfere since they require separate registers to avoid clobbering each other’s data. Interference is represented by edges. Then the problem reduces to graph colouring, finding colours (registers) for each node (variable) that don’t coincide with the colours (registers) of any nodes connected by edges (interfering variables). Initially, Panfrost used a graph colouring approach.

However, the algorithm above is difficult to adapt to irregular vector architectures. One alternative approach is to model the register file explicitly, modeling modeling interference as constraints and using a constraint solver to allocate registers. For the above program, letting $$k_i$$ denote the register allocated to variable $$i$$, there is a single constraint on the registers $$k_0 \ne k_1$$, since $$(0, 1)$$ is the only pair interfering nodes, yielding a optimal valid solution $$k_0 = 0, k_1 = 1, k_2 = 0, k_3 = 0$$, corresponding to the allocation above.

As-is, this approach is mathematically equivalent to graph colouring. However, unlike graph colouring, it extends easily to model vectorization, enabling per-component liveness tracking, so some components of a vector are allocated to a register while others are reused for another value. It also extends easily to vectors of varying type sizes, crucial for 16-bit support, whereas the corresponding generalization for graph colouring is much more complicated.

This work was originally conducted in October for the Midgard compiler, but the approach works just as well for Bifrost. Although Bifrost is conceptually “almost” scalar, there are enough corner cases where we dip into vector structures that a vector-aware register allocator is useful. In particular, 16-bit instructions involve subdivides 32-bit registers, and vectorized input/output requires contiguous registers; both are easily modeled with linear-style constraints.

## Packing

The final stage of a backend compiler is packing (final code generation), taking the scheduled, register allocated IR and assembling a final binary. Compared to Midgard, packing for Bifrost is incredibly complicated. Why?

Vectorized programs for vector architectures can be smaller than equivalent programs for scalar architectures. The above instruction sequences to add a 4-channel vector corresponds to just a single instruction on Midgard:

VADD.FADD R0.xyzw, R0.xyzw, R1.xyzw

We would like to minimize program size, since accesses to main memory are slow and increasing the size of the instruction cache is expensive. By minimizing size, smaller caches can be used with better efficiency. Unfortunately, naively scalarizing the architecture by a factor of 4 would appear to inflate program size by 4x, requiring a 4x larger cache for the same performance.

We can do better than simply duplicating instructions. First, by simplifying the vector semantics (since we know most code will be scalar or small contiguous vectors), we eliminate vector write masks and swizzles. But this may not be good enough.

Bifrost goes beyond to save instruction bits in any way possible, since small savings in instruction encoding accumulate quickly in complex shaders. For example, Bifrost separates medium-latency register file accesses from low latency “port” accesses, loading registers into ports ahead of the instruction. There are 64 registers (32-bits each), requiring 6-bits to index a register number. The structure to specify which registers should be loaded looks like:

unsigned port_0 : 5;
unsigned port_1 : 6;

We have 6 bits to encode port 1, but only 5 bits for port 0. Does that mean port 0 can only load registers R0-R31, instead of the full range?

Actually, no - if port 0 is smaller than port 1, the port numbers are as you would expect. But Bifrost has a trick: if port 0 is larger than port 1, then the hardware subtracts 63 from both ports to get the actual port number. In effect, the ordering of the ports is used as an implicit extra bit, storing 12-bits of port data in 11-bits on the wire. (What if the port numbers are equal? Then just reuse the same port!)

Similar tricks permeate the architecture, a win for code size but a loss for compiler packing complexity. The good news is that our compiler’s packing code is self-contained and unit tested independent of the rest of the compiler.

## Conclusion

Putting it all together, we have the beginnings of a Bifrost compiler, sufficient for the screenshots above. Next will be adding support for more complex instructions and scheduling to support more complex shaders.

Architecturally, Bifrost turns Midgard on its head, ideally bringing performance improvements but rather complicating the driver. While Bifrost support in Panfrost is still early, it’s progressing rapidly. The compiler code required for the screenshots above is all upstreamed, and the command stream code is working its way up. So stay tuned, and happy hacking.

Originally posted on Collabora’s blog

## March 24, 2020

### Erik Faye-Lund

#### Introducing OpenCL™ and OpenGL® on DirectX

For the last few months, we have been working on two exciting new projects at Collabora, and it’s finally time to share some information about them with the world:

We are partnering with Microsoft DirectX engineers to build OpenCL and OpenGL mapping layers, in order to bring OpenCL 1.2 and OpenGL 3.3 support to all Windows and DirectX 12 enabled devices out there!

This work builds on a lot of previous work. First and foremost, we are building this by using Mesa 3D, with the Gallium interface as the base for the OpenGL layer, and NIR as the base for the OpenCL compiler. We are also using LLVM and the SPIRV-LLVM-Translator from Khronos as the compiler front-end.

In addition, we are taking advantage of Microsoft’s experience in creating their D3D12 Translation Layer, as well as our own experience from developing Zink.

## What is Mesa 3D?

Mesa 3D is an open source implementation of several graphics technologies, including OpenCL and OpenGL. The OpenGL implementation in Mesa is robust and is used as the base for several industry-strength OpenGL drivers from multiple GPU vendors.

Among other things, Mesa consists of several API implementations (called state-trackers) as well as the Gallium low-level driver interface. The Gallium interface hides a lot of the legacy OpenGL details and translates OpenGL calls into something that looks more like modern GPU primitives.

## Why translate APIs?

Not all Windows-powered devices have consistent support for hardware-accelerated OpenCL and OpenGL. So in order to improve application compatibility, we are building a generic solution to the problem. This means that a GPU vendor only has to implement a D3D12 driver for their hardware in order to support all three APIs.

This mapping layer is also expected to serve as a starting point in porting older OpenCL and OpenGL applications over to D3D12.

In addition, we believe this is good for the wider open source community. A lot of the problems we are solving here are shared with other drivers and translation layers, and we hope that the code will be useful beyond the use cases listed above.

## Implementation

The work is largely split into three parts: an OpenCL compiler, an OpenCL runtime, and a Gallium driver that builds and executes command-buffers on the GPU using the D3D12 API.

In addition, there is a shared NIR-to-DXIL shader compiler that both components use. For those not familiar with NIR, it is Mesa’s internal representation for GPU shaders. Similarly, DXIL is Microsoft’s internal representation, which D3D12 drivers will consume and translate into hardware-specific shaders.

### OpenCL compiler

The OpenCL compiler uses LLVM and the SPIRV-LLVM-Translator to generate SPIR-V representations of OpenCL kernels. These, in turn, are passed to Mesa’s SPIR-V to NIR translator, where some optimizations and semantical translations are done. Then the NIR representation is finally passed to NIR-to-DXIL, which produces a DXIL compute shader and the needed metadata so it can be executed on the GPU by the runtime using D3D12.

Here’s a diagram of the complete process, including NIR-to-DXIL, which will be described below:

### OpenCL runtime

While Mesa provides an OpenCL implementation called Clover, we are not using it for this project. Instead, we have a new OpenCL runtime that does a more direct translation to the DirectX 12 API.

### NIR-to-DXIL

DXIL is essentially LLVM 3.7 bitcode with some extra metadata and validation. This was a technical choice that made sense for Microsoft because all the major driver vendors already used LLVM in their compiler toolchain. Using an older version of the LLVM bitcode format gives good compatibility with drivers because the LLVM bitcode format is backwards compatible.

Because we depend on a much more recent version of LLVM for the compiler front-end, we sadly cannot easily use the DirectX Shader Compiler as a compiler back-end. The DirectX Shader Compiler is effectively a fork of LLVM 3.7, and we are currently using LLVM 10.0 for the compiler front-end. Using DirectX Shader Compiler as that would require us to link two different versions of LLVM into the same binary, which would have led to problems.

We also cannot easily use LLVM itself to generate the bitcode. While the LLVM bitcode format is backwards compatible, LLVM itself is not forward compatible. This means that newer versions of LLVM cannot produce a bitcode format that is understood by older versions. This makes sense from LLVM’s point of view because it was never meant as a general interchange format.

So instead, we have decided to implement our own DXIL emitter. This is quite a bit harder than it looks because LLVM bitcode goes to great lengths to try to make the format as dense as possible. For instance, LLVM does not store its bitcode as a sequence of bytes and words, but rather as variable-width bitfields in a long sequence of bits.

There are a lot of tricky details to get right, but in the end we have a compiler that works.

### D3D12 Gallium driver

The D3D12 Gallium driver is the last piece of the puzzle. Essentially, it takes OpenGL commands and, with the help of the NIR to DXIL translator, turns them into D3D12 command-buffers, which it executes on the GPU using the D3D12 driver.

There are a lot of interesting details that makes this tricky as well, but I will save those details for later.

But to not leave you empty-handed, here’s a screenshot of the Windows version of the famous glxgears, wglgears:

## Source code

In the short term, the source code can be found here. We intend on upstreaming this work into the main Mesa repository shortly, so it is not a permanent home.

## Next steps

This is just the announcement, and a whole lot of work is left to be done. We have something that works in some cases right now, but we are just starting to scratch the surface.

First of all, we need to get up to the feature-level that we target. Our goals at the moment is to pass conformance tests for OpenCL 1.2 and OpenGL 3.3. We have a long way to go, but with some hard work and sweat, I am sure we will get there.

Secondly, we need to work on application compatibility. For now we will be focusing on productivity applications.

We also want to upstream this in Mesa. This way we can keep up with fixes and new features in Mesa, and other drivers can benefit from what we are doing as well.

## Acknowledgments

It is also important to point out that I am not the only one working on this. Our team consists of five additional Collabora engineers (Boris Brezillon, Daniel Stone, Elie Tournier, Gert Wollny, Louis-Francis Ratté-Boulianne) and two Microsoft DirectX engineers (Bill Kristiansen, Jesse Natalie).

## March 09, 2020

### Alexandros Frantzis

#### Interactive time series graph of confirmed COVID-19 cases per country

I have been looking for more information about the trends of the number of confirmed COVID-19 cases per country. I could find the numbers for each day, but I wanted a visual representation of the information in order to readily compare the trends in various regions.

John Hopkins University (JHU) has set up a great dashboard tracking the COVID-19 cases worldwide. It contains a time series graph of the overall cases, but doesn't provide such graphs for each country. Thankfully, JHU have made all their data publicly available (for educational/academic purposes) in their JHU CSSE COVID-19 repository.

Since I couldn't find a visualization that matched my needs I decided to create my own interactive graph:

Interactive time series graph of confirmed COVID-19 cases per country

The graph is implemented as a web page that uses client-side processing (i.e., javascript) to plot the time series for one or more countries. The JHU data is fetched and processed by the client on each page load, so the graph will display the latest data without requiring any server-side updates.

For example, this is a plot from that page for France, Italy and Spain as of 2020-03-08:

Given the time offset between the trend lines, it's not always clear how the trends compare. To be able to compare trends more easily, I added an option to synchronize the displayed time series so that they start at (approximately) the same confirmed case count. Here is the same series as above but synchronized at 200 cases:

This plot indicates that France and Italy are following roughly the same trend, whereas Spain is faring a bit better.

Before you draw any conclusions about trends that show up after synchronization, keep in mind two important caveats.

First, since our sampling granularity is one day, and jumps in confirmed cases between successive days can be large, starting points for each series after synchronization may in fact be different enough to skew results. For example, we may want to synchronize at around 200, but if one series jumps from 50 to 300, the starting point for that series is not going to be near 200.

Second, using different synchronization points may provide different views of the trends. The same France-Italy-Spain plot synchronized at 40 cases gives:

In this plot we can see that France and Spain initially followed a similar trend, but then France started diverging upwards. This is somewhat apparent in the original, non-synchronized plot, but only because the France and Spain trend lines have a small time offset.

I hope people find this interactive graph informative!

## February 21, 2020

#### Follow-up on the train journey to FOSDEM

Here’s a recap of my train journey based on the Twitter thread I kept posting as I travelled.

## To FOSDEM…

The departure from Bratislava was as planned:

Half an hour in Vienna was just enough for me to grab some coffee and breakfast and board the train to Frankfurt without a hurry:

Unfortunately, soon after we left Linz and headed to Passau, the train broke down. Apparently, it powered down and the driver was struggling to reboot it. After more than an hour at Haiding, we finally departed with a huge delay:

Since the 18:29 train to Brussels I needed to catch in Frankfurt was the last one that day, I was put into a hotel Leonardo across the street from Frankfurt Hbf, paid by Deutsche Bahn, of course. By the time of our arrival in Frankfurt, the delay was 88 minutes.

Luckily, I didn’t have to convince Deutsche Bahn to let me sleep in the morning, they happily booked me (for free) onto a 10:29 ICE to Brussels so I had an opportunity to have a proper breakfast at the hotel and spend some time at Coffee Fellows at the station.

Fun fact: Aachen is called Cáchy in Czech, apparently as a corruption of an older German form ze Aachen.

Having met some Debian people on the train, I have finally arrived in Brussels, albeit with some delay. This, unfortunately meant that I haven’t gone to Vilvoorde to see a friend, so the regional tickets I bought online were useless.

## … and back!

The trip home was much better in terms of missed trains, only if a tiny bit more tiring since I took it in one day.

Going to Frankfurt, I’ve spent most of the time in the bistro carriage. Unfortunately, the espresso machine was broken and they didn’t have any croissants, but the tea with milk was good enough.

I’ve used the fifty minutes I had in Frankfurt to claim the compensation for the delay, which (€33) I received in my bank account the next week.

Finally, exactly twelve hours and one minute after the departure, almost home:

## January 31, 2020

#### Using gcc sanitisers to get a nasty bug fixed

A couple of days ago a colleague at Collabora asked me to help create a Debian package for the tool he needed to complete his task. The tool happened to be an NXP code signing tool, used to sign OS images to be run on i.MX systems using ‘High Assurance Boot’.

As it often happens, the tool was distributed in a manner typical for many big corporations: no direct link to the tarball, custom buildsystem, compiled binaries in the same tarball as the sources. A big relief was that the tool has been distributed under a three-clause BSD license since version 3.3.0 (the sources were not provided at all before that).

The custom buildsystem meant it was not using the Debian build flags (CFLAGS := …, CC := gcc), so I had to plug a custom ‘toolchain’ Makefile into the right place with definitions relevant to Debian:

DPKG_EXPORT_BUILDTOOLS = 1

include /usr/share/dpkg/buildtools.mk

LD = $(CC) COPTIONS += -std=c99 -D_POSIX_C_SOURCE=200809L -fPIC$(shell dpkg-buildflags --get CPPFLAGS) $(shell dpkg-buildflags --get CFLAGS) LDOPTIONS +=$(shell dpkg-buildflags --get LDFLAGS)


After some time of playing with the build system, the package was done and sent to a colleague pending an upload to Debian. What did worry me a bit was a weird warning the compiler gave me every time:

../../code/front_end/src/acst.c: In function ‘encrypt_images’:
../../code/front_end/src/acst.c:791:25: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
791 |         uint8_t *hash = malloc(hash_size);
|                         ^~~~~~~~~~~~~~~~~


We didn’t use encryption, so I didn’t feel motivated to investigate.

Next day, the colleague comes back to me with this:

I did test builds in Apertis SDK, Debian buster, Fedora 31, ALT Linux 9 -- only the latter environment produced the cst tool which generated correct signature for the file.

The rest produce a bit different signature with zeroes for the current kernel:

000007E9 00 30
000007EA 00 82
000007EB 00 01
000007EC 00 F9
00000D41 00 30
00000D42 00 82
00000D43 00 01
00000D44 00 F9


I found that the problem is related to compilation of code/front_end/src/cst.c.

The colleague sent me a tarball with the working binary and a simple test case.

### First attempt

Okay, let’s see. What are those eight different bytes? The signing tool comes with a utility, csf_parser to look inside of the signed file and extract bits and pieces. With debug logs on, for the correctly signed file the logs say this:

CSF TAG: 0xD8
Parsing following values
0xD8 0x02 0x01 0x40 0x30 0x82 0x01 0xF9 0x06 …


The incorrectly signed one has this:

CSF TAG: 0xD8
Parsing following values
0xD8 0x02 0x01 0x40 0x00 0x00 0x00 0x00 0x06 …


What’s 0xD8? hab_types.h says:

#define HAB_TAG_SIG  0xd8         /**< Signature */


Apparently, the first four bytes are a header of the data chunk, so the differing bytes are part of the signature itself.

Let’s add some debug prints to see what’s going on. Oh no: now the difference is not just 8 bytes, but much more. Looks like there’s some memory corruption or undefined behaviour or maybe both.

Meanwhile, the colleague chimes in: he tried linking cst.o from ALT Linux with the rest of the program built on Debian, and produced a working binary.

My first thought was: oh, it’s just some code relying on undefined behaviour, what if I decompiled both object files and compared them?

In the past, when I needed to disassemble something I used IDA, but since it’s proprietary, I never had a license for it, and I don’t have it anymore anyway, it was not an option. It did, however, have a very good decompilation module which produced mostly readable C-like pseudocode.

When a couple of years ago I looked for a free software replacement for IDA, radare seemed like it had some potential, but it didn’t seem ready, and it wasn’t fully packaged for Debian. The things have changed, and not only radare is in Debian, but it also comes with a GUI, Cutter, which is very good, even though there’s always toom for improvement.

To my surprise, it turned out radare also has some decompilation functionality which worked really well.

Anyway, I won’t go further into it since this experiment didn’t give me any useful data to work with.

### Trying it with gdb

Good, so the code at fault is somewhere close to the signing functions, so let’s try and debug it.

I rarely use the command-line gdb for anything else than printing backtraces, so here’s a small cheatsheet for the next time I need it:

• break -func create_sig_file: set breakpoint at a function
• finish: ‘step out’, run to the end of the current function and stop right after it returns
• x/10xb pointer: print 10 hexadecimal bytes

This is what I have figured during my debugging session: the data gets corrupted at the end of the signing sequence, inside or before the cms_to_buf function invocation, which gets the signature data copied from the internal OpenSSL data structure into the output buffer:

(gdb) finish
Run till exit from #0  cms_to_buf (cms=cms@entry=0x555555598a60, bio_in=bio_in@entry=0x555555597ff0,
data_buffer=data_buffer@entry=0x7fffffffd380 "\003", data_buffer_size=data_buffer_size@entry=0x7fffffffd37c,
0x0000555555564399 in gen_sig_data_cms (sig_buf_bytes=0x7fffffffd37c, sig_buf=0x7fffffffd380 "",
hash_alg=<optimised out>, key_file=0x555555578f70 "IMG1_1_sha256_2048_65537_v3_usr_key.pem",
cert_file=0x555555591440 "IMG1_1_sha256_2048_65537_v3_usr_crt.pem", in_file=<optimised out>)
Value returned is $2 = 0 (gdb) x/10xb 0x7fffffffd380 0x7fffffffd380: 0x00 0x00 0x00 0x00 0x06 0x09 0x2a 0x86 0x7fffffffd388: 0x48 0x86 (gdb) finish Run till exit from #0 cms_to_buf (cms=0x445290, bio_in=0x444800, data_buffer=0x7fffffffcbe0 "P\320\377\377\377\177", data_buffer_size=0x7fffffffcbcc, flags=17090) at ../../code/back_end/src/adapt_layer_openssl.c:365 0x00000000004109f6 in gen_sig_data_cms (in_file=0x416553 "imgsig.bin", cert_file=0x43dc50 "IMG1_1_sha256_2048_65537_v3_usr_crt.pem", key_file=0x43d640 "IMG1_1_sha256_2048_65537_v3_usr_key.pem", hash_alg=SHA_256, sig_buf=0x7fffffffcbe0 "0\202\001\371\006\t*"..., sig_buf_bytes=0x7fffffffcbcc) at ../../code/back_end/src/adapt_layer_openssl.c:487 487 in ../../code/back_end/src/adapt_layer_openssl.c Value returned is$5 = 0
(gdb) x/10xb 0x7fffffffcbe0
0x7fffffffcbe0:    0x30    0x82    0x01    0xf9    0x06    0x09    0x2a    0x86
0x7fffffffcbe8:    0x48    0x86


### Trying it with splint and valgrind

Having failed to find the data corruption by manually inspecting the source, I tried enabling every compiler warning I could, which gave me no results, and then decided to use external tools to help me with that.

First comes splint, a C linter. It produced about a thousand warnings and errors which, while probably were at least partially valid, completely overwhelmed me.

Next came valgrind, which, even at the most aggressive settings only pointed me at uninitialised padding data in some structures:

number:  NUMBER
{
$$= malloc(sizeof(number_t));$$->next = NULL;
->num_value = $1; } ;  And, of course, OpenSSL was reading uninitialised data to improve the randomness. Not very helpful. ### Almost giving up Since I have spent most of the day debugging, valgrinding, reading code and worrying about my soon to be missed connection to Brussels in Frankfurt, I became very frustrated about having no useful results and was close to giving up and calling it a day. As a last chance attempt, I decided to try all hardening options I could find in the GCC manual page and on the Debian wiki. First, DEB_BUILD_MAINT_OPTIONS = hardening=+all, -fPIE -pie. Not much changed. Or rather nothing at all. Then, I found sanitisers. Okay, dump it all into the makefile: -fsanitize=address -fsanitize=undefined -fsanitize=leak -fno-omit-frame-pointer. Compile, link, run: nothing works. Apparently, they need link options as well: -static-libasan -static-libubsan -static-liblsan Cracked it! /** * sig_size as input to gen_sig_data shows the size of buffer for * signature data and gen_sig_data returns actual size of signature * data in this argument */ uint32_t sig_size = SIGNATURE_BUFFER_SIZE; ... /** * Calling gen_sig_data to generate signature for data in file using * certificate in cert_file. The signature data will be returned in * sig and size of signature data in sig_size */ ret_val = gen_sig_data(file, cert_file, hash, sig_fmt, sig, (size_t *)&sig_size, g_mode);  Of course, size_t is 32-bit on 32-bit systems, but significantly wider on 64-bit ones, so no surprise this code would overwrite something else on the stack, in this case the signature data. One of the lessons I learnt from this way: never underestimate the compiler’s built-in tools. At least, not the sanitisers! Oh, just in case you want to see more: the patches currently live at the Apertis GitLab. 2020-02-18 Update: This post has also been published at Collabora’s website: https://www.collabora.com/news-and-blog/blog/2020/02/18/using-gcc-sanitisers-to-get-a-nasty-bug-fixed ## January 30, 2020 ### Sebastian Reichel #### Collabora & Linux Kernel 5.5 Linus tagged the 5.5 kernel release, so let’s have a look at Collabora’s participation. Adrian, Martyn and Gustavo Padovan worked on the Broadcom brcmfmac WiFi driver fixing suspend support for some platforms which cut power during sleep to save battery and removing an incorrect warning message. These were done as part of a broader effort to improve upstream support of peripherals used together with the i.MX 6 family of processors. ## January 28, 2020 ### Andrew Shadura #### FOSDEM by train I’ve always loved train journeys, but with flygskam changing people’s travel preferences across Europe (and possibly worldwide, though probably not that much), I decided to take train to FOSDEM this time. When I first went to FOSDEM which, just in case you don’t know, happens each February in Brussels at ULB, I flew with Ryanair from Bratislava to Charleroi because it was cheaper. After repeating the same journey a couple of times and I once nearly missed the last bus coach to Brussels because of a late flight, and decided to rather pay more but travel with more comfort to Brussels Zaventem, the main airport of Brussels. It’s well-connected with Brussels, trains run fast and run often, which is a significant upgrade in comparison to Charleroi, where the options were limited to bus coaches and a slow train connection from Charleroi the town. As some of my readers may know, my backpack was stolen from me after FOSDEM two years ago, and with it were gone, among other things, my passport and my residence permit card. With my flight home having been planned two and half hours from the moment when I realised my things are gone, I couldn’t get a replacement travel document quickly enough from the embassy, so I had to stay at my friends in Vilvoorde (thanks a lot again, Jurgen!) and travel with the cheapest ground transportation I could find. In my case, it was a night RegioJet coach to Prague with a connection to (again) RegioJet train to Bratislava. (I couldn’t fly even though I already had my temporary travel document since I might need to somehow prove that I’m allowed to be in the Schengen zone, which is difficult to do without a valid residence permit.) Sleeping on a bus isn’t the best way to travel for long distances, and I was knackered when I finally dropped on my sofa in Bratislava next morning. However, what I learnt was that it was possible, and were it a bit more comfortable, I wouldn’t mind something like this again. Here I must admit that I’ve travelled by long-distance trains quite a fair bit: in my childhood we went by train to summer holidays to Eupatoria in Crimea and Obzor in Bulgaria (through Varna). Both journeys took days, and the latter also involved a long process of changing the bogies on the Moldovan-Romanian border (Giurgiulești/Galați). Since I moved to Slovakia, I many times took the night train from Minsk to Warsaw with a connection to Bratislava or Žilina, a journey which usually takes at least 18 hours. Which is to say, I’m not exactly new to this mode of travel. With the Austrian railway company ÖBB expaning their night train services as a part of their Nightjet brand, the Vienna to Brussels sleeper returned to service last week. Prices are still a bit higher than I would have preferred (at the time of me writing this, ticket in a compartment with 6 couchettes start at €79, but it’s not as bad as it could be (apparently the last minute price is more than €200). Anyway, when I decided to go to Brussels by train, this service didn’t exist yet, so instead I followed the very useful tips from the Man in the Seat 61 and booked a day-time connection: Bratislava to Vienna to Frankfurt to Brussels. Date Station Arrival Departure Train 30.1 Bratislava hl.st. 9:38 REX 2511 Wien Hbf 10:44 11:15 ICE 26 Frankfurt am Main Hbf 17:36 18:29 ICE 10 Bruxelles-Nord / Brussel Noord 21:26 21:37 IC 3321 Vilvoorde 21:45 Date Station Arrival Departure Train 4.2 Bruxelles-Midi 8:23 ICE 13 Frankfurt am Main Hbf 11:31 12:22 ICE 27 Wien Hbf 18:45 19:16 R 2530 Bratislava hl.st. 20:22 I’ve booked two through tickets since in this case the Super Sparschiene discount offered lower prices than normally an international return ticket would offer. For some reason neither ZSSK (Slovak railways) nor ÖBB offered ticket online (or for a comparable price in a ticket office anyway), so I booked online with Deutsche Bahn for €59.90 each way. This sort of ticket, while bookable online, had to be posted for a €5.90 extra. Since I’m staying the first night at friend’s in Vilvoorde again, I also had buy a ticket for this small stretch of the track from the Belgian railways directly. See you at FOSDEM! ## December 16, 2019 ### Alexandros Frantzis #### Efficient saving of multi-source buffers ## Preface Several years ago I was working on libbls, a library implementing editable buffers that can efficiently hold contents originating from both memory and file sources. Typical clients of this libraries are programs that need to edit arbitrary portions of very large files, like hex editors or wave editors, while minimizing the required memory usage. One of the more interesting problems I encountered was devising an efficient way to save buffers to their corresponding files in place, while minimizing the extra memory (or disk space) used in the process. This post describes how this problem, which turns out to be NP-hard, can be modeled as a graph, and how we can provide a reasonable solution by using a variation of a standard graph algorithm. ## Introduction Buffers in libbls are described in terms of segments that contain data from either memory or file sources. In order to conserve memory (and sometimes to even make the editing feasible) the data from file sources is not loaded into main memory. Instead, the buffer keeps information about the contained files and ranges and reads the data in vm-page-sized chunks when needed. So, a buffer could look like: B: |F1:0-9 |M1:10-18|F1:10-19 |M2:10-19|F2:0-6|  Where F1 and F2 are file sources, M1 and M2 are memory sources and S:B-E denotes a segment representing bytes from the range [B, E] of source S. In the simple case, saving a buffer to a file consists of just reading data from the various sources and writing them out to the target file. Things become more complicated when the target file happens to be one of the file sources used in the buffer. This post aims to illustrate the issues that can arise in such cases and propose an elegant way to resolve them. ## Illustrating the problem Saving the buffer to a file typically involves reading each segment in order and writing it out to the file. However, if the segment source is the same as the file that we are writing to, and depending on the ordering of segments, we may face a serious problem. To illustrate the problem, consider a buffer B containing segment [0-9] of file F at range [10-19] of the buffer, and assume that we want to write the buffer back to file F:  S F: |0-9 |.........| S B: |.........|F:0-9 | ^ ^ ^ 0 10 20  Writing the buffer segments in order leads to a situation where we first write out buffer range [0-9] to file F, overwriting the contents of the file at that range. So, when moving on to the next segment, trying to write out the range [10-19] of the buffer, the data we are going to read from the file (and therefore write) is not the original data of [F:0-9], but the data belonging to the previous segment of the buffer we just wrote at positions [0-9]. The solution in this case is to first write out range [10-19] of the buffer to the file, so that the data we use is the original file data, and only then write out any other segments. The problem can become even more complicated when we have more segments from the target file. A particularly interesting case arises when a segment from the buffer which belongs to the target file happens to overlap with another segment of F which is also present in the buffer:  S T F: |0-9 |.........|20-29 |.........| T S B: |.......|F:20-29 |...........|F:0-9 | ^ ^ ^ ^ 0 8 18 30  In this case segment S in F overlaps with segment T in B. We can simply try to adapt the strategy used in the previous case and first write out the two target file segments. This works but only if we are extra careful. In this case, if we first write segment T then when we try to read the data of segment S we will read wrong data (file range 8-9 will contain data from segment T). If we do it the other way around everything works wonderfully. Taking the last example one step further, consider what happens if we have cyclic overlaps:  S T F: |0-9 |.........|20-29 |....| T S B: |.......|F:20-29 |......|F:0-9 | ^ ^ ^ ^ ^ 0 8 18 25 30  Now segment S in F overlaps with segment T in B, and segment T in F overlaps with segment S in B. In this case there is no way to order the writes of segments S and T so that we end up with the correct data. A straightforward but wasteful way of tackling this problem is to save the whole buffer to a temporary file and then move the file to its final location. This works, and in some cases is preferred since it has the benefit of atomicity, but requires extra space for the temporary file. When buffers reach many GiB in size this method may prove to be unfeasible. A more efficient method is to try to eliminate the cyclic overlaps by removing at least one of the overlaps involved. In the previous case we can replace segment T in B with two segments M and T1 so that no overlap occurs. Segment M will contain a part of T's data but will actually store them in memory (or a temporary file if necessary) and the T1 segment will just be a T minus the range stored in M. This approach still requires some amount of extra memory, but in most cases this amount is much smaller than the size of the whole buffer. Using this scheme we have:  S T1 F: |0-9 |...........|22-29 |....| M T1 S B: |.......|.|F:22-29|......|F:0-9 | ^ ^ ^ ^ ^ 0 8 18 25 30  We have managed to eliminate the first overlap and now we can safely save the buffer by first writing T1 then S and then the remaining segments of B. ## Modeling the problem The examples presented in the previous section show representative but simple cases of the problems we can run into. In order to be able to handle cases of arbitrary complexity it is useful to create a model of the problem. We can then use this model to come up with algorithms that provide generic solutions. The model we are going to present uses graphs. In these overlap graphs vertices represent segments from the file we are going to save to, that are also present in the buffer we are going to save. Edges between vertices denote segment overlap: an edge from vertex P to vertex Q denotes that segment P in the target file overlaps with segment Q in the buffer. Every edge carries with it the size of the overlapping range. Here are some buffers and their respective overlap graphs (P* denotes a self-loop on P):  P Q R F: |0-9 |....|15-24 |.........|35-42 | P B1: |........................|F:0-9 |.......| P P Q B2: |.........|F:0-9 |...|F:15-24 |.......| Q --> P P Q B3: |....|F:0-9 |......|F:20-29 |..........| P* Q* R P Q B4: |F:35-42|F:0-9 |...........|F:20-29 |..| R <-- P* <-- Q \___________^  ## Solving the problem Using the overlap graph we can now answer the question: in what order should the vertices (segments) be written to the file so that the correct data is written? If the graph has no cycles then we can process the vertices in topological order. Topological order guarantees that each vertex is processed only when its dependencies have already been processed and written to file, and thus that no unwritten file segment in the buffer overlaps with the destination range of that vertex. As a special case, we can process a vertex that has a self-loop but no other incoming edge. This is achieved by first writing to the file the non-overlapping part of the vertex and then the rest. In order to handle graphs with cycles (except self-loops) we must find a way to break the cycles. This can be achieved by removing the correct edges. An edge from P to Q can be removed by replacing Q in the buffer by the segments Q1 and M as described previously. M contains an in memory copy of data of the overlapping range. Q1 is the part of Q that remains. This technique eliminates the overlap (because the overlapping part is now stored in memory) and removes the edge. By choosing the edges to remove wisely, we can minimize the amount of extra memory we are going to need. Once we have an acyclic graph (self-loops allowed) we can process it as described previously. Let's manually apply this method to an example:  P Q R F: |0-9 |....|15-24 |.........|35-42 | R P Q B4: |F:35-42|F:0-9 |...........|F:20-29 |..| R <-- P* <-- Q \___________^  We first have to make the graph acyclic. We have a few choices for which edge to break: • Q->P, a 3 byte overlap • P->R, a 7 byte overlap • R->Q, a 5 byte overlap To minimize the required extra memory we break the Q->P edge by storing the last three bytes of P in memory as segment M, so we get:  P1 Q R F: |0-6 |.......|15-24 |.........|35-42 | R P1 M Q B4: |F:35-42|F:0-6 |..|...........|F:20-29 |..| Q <-- R <-- P1  Note that coincidentally we also resolved the self-loop of P. Now that we have an acyclic graph we can process the buffer segments in topological order, thus we first store P1 in range [8-14] of the file, next is R at [0-7] and finally Q at range [30-40] followed by all other portions of the buffer. ## Implementation Details libbls implements the solution described above when saving a file in place. The high level logic for the algorithm resides in bless_buffer_save. The implementation of the algorithm within that function can be conceptually split into three steps. Some implementation details for each step are given below, along with links to the functions that provide the relevant functionality. #### Step 1: Creating the overlap graph Relevant functions: buffer_file.c: create_overlap_graph overlap_graph.c: overlap_graph_add_segment To create the overlap graph we have to scan the buffer segment collection for segments that belong to the file we want to save to. Each new segment that is found must be checked against all previous found file segments for overlap. This is an O(n + k*f(k)) algorithm where n is the number of segments in the segment collection and k the number of segments belonging to the file. If we check for overlap naively then f(k) = O(k) and therefore the algorithm is O(n + k*k). We can use an interval tree so that f(k) = O(logk) and the algorithm becomes O(n + k*logk). As we don't expect k to grow too large, the implementation currently uses the naive approach for simplicity. #### Step 2: Making the graph acyclic Relevant functions: overlap_graph.c: overlap_graph_remove_cycles In the second step the goal is to remove cycles from the graph, while trying to minimize the total weight of the removed edges in order to minimize extra memory usage. However, it turns out that the problem we are trying to solve, minimum feedback arc set, is NP-Hard! Don't despair though; we can get reasonable results with only little extra work. The undirected counterpart of the minimum feedback arc set problem is the minimum (or maximum) spanning tree (MST) problem, which can be efficiently solved by using various methods. We can use these methods to give a solution (but not the best) for our problem, too. We can do that because MST algorithms remove all undirected cycles and therefore all directed ones, too. The caveat is that they remove more that we need them to; undirected cycles that are not directed. MST algorithms work by processing edges in non-decreasing or non-increasing order of weight and adding them to the already formed tree (Prim) or trees (Kruskal) if they aren't already connected to them by an undirected path. Because our goal is to avoid directed cycles we can be more lax with the rules of adding an edge. A simple but useful observation is that if the destination node of an edge has no outgoing edges then a directed cycle cannot be formed, regardless of whether there is a pre-existing undirected path between the two nodes. Similarly, if the source node of an edge has no incoming edges a directed cycle cannot be formed either, regardless of the pre-existence of an undirected path between the source and destination nodes. We can use the previous observations to decrease the number of edges that are incorrectly deleted by the MST algorithms when acting on directed graphs. Although we don't completely eliminate the erroneous deletions, these rules give reasonable results while keeping the complexity the same as in the undirected case. Of the MST algorithms, Kruskal's algorithm can be used unchanged in directed graphs because it works directly on edges and doesn't care about their direction. It also works on disconnected graphs, since it finds a spanning forest. For these reasons, it was selected as the basis for the algorithm used by libbls. #### Step 3: Removing edges and saving to file Relevant functions: overlap_graph.c: overlap_graph_get_removed_edges buffer_file.c: break_edge buffer_util.c: segcol_store_in_memory, segcol_store_in_file overlap_graph.c: overlap_graph_get_vertices_topo In the previous step we calculated which edges would need to be removed to create an acyclic graph. In this step we first perform the actual removal of these edges by storing the associated data in memory or a temporary file as described previously. Removing the edges has the unfortunate side effect of further altering the graph as segments are removed, added or changed in the segment collection. This means that after removing the edges we must re-calculate the overlap graph for the segment collection. There may be some way to avoid re-calculating the whole graph, making only local changes to the existing graph as vertices are altered but we will use the simple (although more costly) way for now. We then use this new overlap graph, which is guaranteed not to have any cycles, to get the vertices in topological order and save them to the target file. ## October 28, 2019 ### Guillaume Tucker #### KernelCI and the Linux Foundation A KernelCI developer's view on the project joining the Linux Foundation As announced today at the Open Source Summit in Lyon (see the official press release), the KernelCI project which powers kernelci.org with automated testing for the upstream Linux kernel has now joined the Linux Foundation. This is the new home the project has finally found after sailing through uncharted waters for over five years. ## What does it actually mean for the project? Given the huge scale at which the Linux kernel is being used, achieving comprehensive test coverage for it is an incredibly challenging goal. Based on the open source philosophy principles, KernelCI's distributed architecture makes this possible by enabling the whole kernel community to collaborate around a single upstream CI system. Becoming part of the Linux Foundation will let the project flourish and become in turn integral part of the Linux kernel development workflow. Some actual results of this move can already be seen with the new common database for storing test results from KernelCI and other related projects that share a common goal such as Red Hat's CKI and Linaro's LKFT. It is an experimental step towards expanding KernelCI in a modular way to interact with other existing systems. There are as many different ways to test the kernel as there are use-cases for it, and many types of specialised systems to cover: a CPU architecture such as 0-day, a Linux distribution such as CKI, a range of reference platforms such as LKFT... ## What happens next? This is only a new beginning, many things are yet to come with many decisions to be made: how to use the budget available from the membership scheme, how to make the infrastructure more sustainable and scalable, how to compromise and address the needs of all the members joining the project... Answers to all these questions are likely to appear as the coming months and years unfold. One thing we can be sure of is that there is no reason for the current development plans to stop or be impacted in any way - on the contrary. There is indeed a strong need to extend test coverage and the capabilities of KernelCI at large, with a huge potential to improve upstream kernel development as a direct result. Becoming part of the Linux Foundation should be about facilitating progress in this direction above all. ## What does Collabora plan to do? Collabora has been involved with KernelCI since almost the beginning, providing some of the servers running the kernelci.org services and a large LAVA test lab. We've also become the biggest contributor to KernelCI development with myself as de facto maintainer for the core repositories on Github and in charge of the weekly kernelci.org updates. Among all the things we have done in the last couple of years, these are probably the most significant ones: • Enabled automated bisection for boot regressions • Added v4l2-compliance test suite and a few more minor ones (suspend/resume...) • Enabled "Depthcharge" bootloader in LAVA to run KernelCI tests on Chromebooks • Portable command-line tools and YAML-based configuration to perform all the KernelCI steps in any environment Going forward, here are some clear items on the project's current roadmap which will carry on full steam ahead with a growing team of developers: • Further improve handling of test results: web interface, email reports, regression tracking • Extend coverage of test suites: kselftest, LTP, KUnit, i-gpu-tools and many more • Deploy new bisection tool to cover complex test suite results • Improve modular core architecture to enable inter-operability between the various elements that form the overall KernelCI pipeline Then like with every open source project, other contributors will also add more features, enable new platforms in the global test hardware pool or bring new ideas. The combined efforts of all parties is what makes the project unique and mature enought to reach its ultimate goal. We will make sure we continue to be among the top active players within the growing KernelCI community, as a founding member of the Linux Foundation project as well as with myself being both a maintainer and on the technical steering committee. ## October 24, 2019 ### Erik Faye-Lund #### Zink: Fall Update I recently went to XDC 2019, where I gave yet another talk about Zink. I kinda forgot to write a blog-post about it, so here’s me trying to make up for it… or something like that. I’ll also go into some more recent developments as well. My presentation was somewhat similar to the talk I did at SIGGRAPH this year, but with a bit more emphasis on the technical aspect, as the XDC audience is more familiar with Mesa internals. If you’re interested, you can find the slides for the talk here. The talk goes through the motivation and basic approach. I don’t think I need to go through this again, as I’ve already covered that before. As for the status, Zink currently supports OpenGL 2.1 and OpenGL ES 2.0. And there’s no immediate plans on working on OpenGL 3.0 and OpenGL ES 3.0 until Zink is upstream. Which gets us to the more interesting bit; that I started working on upstreaming Zink. So let’s talk about that for a bit. ## Upstreaming Zink So, my current goal is to get Zink upstream in Mesa. The plan outline in my XDC talk is slightly outdated by now, so here I’ll instead say what’s actually happened so far, and what I hope will follow. Before I could add the driver itself, I decided to send all changes outside of the driver as a set of merge-requests: The last one is probably the most interesting one, as it moves a lot of fixed-function operations into the state-tracker, so individual drivers won’t have to deal with them. Unless they choose to, of course. But all of these has already been merged, and there’s just one final merge-request left: This merge-request adds the driver in its current state. It consists of 163 commits at the time of writing, so it’s not a thing of beauty. But new drivers usually aren’t, so I’m not too worried. When this is merged, Zink will finally be a “normal” part of Mesa… Well, sort of anyway. I don’t think we’ll enable Zink to be built by default for a while. But that’ll just be a matter of adding zink to the -Dgallium-drivers meson-option. ### Testing on CI The current branch only adds building of Zink to the CI. There’s no testing being done yet. The reasons for this is two-fold: 1. We need to get a running environment on CI. Rather of bringing up some hardware-enabled test-runner, I intend to try to set up SwiftShader as a software rasterizer instead, as that supports Vulkan 1.1 these days. 2. We need some tests to run. Zink currently only supports OpenGL 2.1, and sadly the conformance test suite doesn’t have any tests for OpenGL versions older than 3.0. Piglit has some, but a full piglit-run takes significantly more time, which makes it tricky for CI. So right now, it seems the OpenGL ES 2.0 conformance tests are our best bet. We’ll of course add more test-suites as we add more features. So, there’s some work to be done here, but it seems like we should be able to get something working without too much hassle. ## Next steps After Zink is upstream, it will be maintained similarly to other Mesa drivers. Practically speaking, this means that patches are sent to the upstream repo rather than my fork. It shouldn’t make a huge difference for most users. The good thing is that if I go away on vacation, or are for some other reason unavailable, other people can still merge patches, so we’d slightly reduce the technical bus-factor. I’m not stopping developing Zink at all, but I have other things going on in my life that means I might be busy with other things at times. As is the case for everyone! :wink: In fact, I’m very excited to start working on OpenGL 3.x and 4.x level features; we still have a few patches for some 3.x features in some older branches. The future is bright! :sunny: ## October 07, 2019 ### Alexandros Frantzis #### mda-rs: Custom Mail Delivery Agents in Rust I had been stubbornly using procmail for two decades to filter my email, before deciding that it's finally time for a change. Procmail served me well through the years, but, little by little, the annoyances started adding up. The fact that procmail is unmaintained since 2001 inspired less and less confidence as years passed, and the requirement in many cases for external processing before any useful matching could be performed (e.g., dealing with MIME content transfer encoded data, or non us-ascii character sets) was a constant pain point. As the complexity of my rules grew, I even had to resort to an external program (a custom C++ program in my case) to perform some of the mailbox selection logic, using the AUTOFOLDER feature of procmail. At that point, and given all the functionality that I had to implement myself, I seriously started questioning the value procmail was providing to me and started looking for alternatives. I evaluated fdm and maildrop, finding in both a lot that I liked; first and foremost a less arcane filtering language. In the end, I found maildrop to be a closer match to my preferences and requirements, and I especially appreciated the automatic MIME content decoding. I briefly considered switching to maildrop, but a few experiments on my set of filtering rules indicated that maildrop's performance was significantly worse compared to procmail, even though for procmail I had to call out to a few more external programs to achieve similar functionality. I also gave fdm a go, and it was even slower than maildrop. I receive a lot of emails each day, mostly from various FOSS mailing lists, and the performance penalty adds up. While waiting for a few dozen seconds daily wouldn't have been the end of the world, the thought that my original and much older solution was faster, disturbed me. Meanwhile, I noticed that the control flow statements in maildrop's filtering language were similar to that of a general purpose programming language, and, in fact, with the integration with regular expressions, they looked a bit like perl. So, I thought, what if instead of using a domain specific language, I could use a general purpose language, augmented by some kind of library to implement the same filtering and delivery functionality. My thoughts then drifted to other things, but the seed of that idea had already been planted in my mind it seems, as a few years later I found myself implementing the code that would become mda-rs. With mda-rs I wanted to create an experience as close as possible to using an interpreted domain specific language, the approach follow by typical MDAs, while still having the performance and power of a full, compiled language at the fingertips. One aspect of this experience was providing an API that feels like natural fit for the intended purpose. The other aspect was providing a straightforward way to build a custom MDA. For this second aspect, the simplicity of Rust's cargo was one of the reasons I decided to use Rust for this project. Another decision catering to a similar ease-of-use goal was that the user shouldn't be required to use external programs to transform the data just so they could perform basic matching. To this end, mda-rs, like maildrop, normalizes the email before processing, by decoding and converting all text MIME content parts (including MIME encoded-words in headers) to plain UTF-8. Of course, I also wanted the resulting custom MDAs to be fast; performance was my main disappointment with other MDAs after all. Performance requires an efficient implementation, but also and an API that encourages performant use. An example of the effect of such a concern on the mda-rs API, is that the API provides access to the header fields by name, so that one can perform targeted per-header-field string searches, which can be much faster than regular expression searches of the whole header. Finally, an important requirement for all MDAs is that the email delivery is durable, i.e., that no email will be lost in case of a system failure. In other words, at any point in time at least one of the local filesystem and the email server should have a complete and fully accessible copy of the email. While investigating the best way to provide such durability guarantees, I noticed that all three MDAs mentioned previously are fsyncing the written email file, as expected. However, they are not fsyncing the containing directory, which could potentially lead to the file not appearing on the filesystem 1. The exact guarantees in this scenario seem to be dependent on the filesystem, but, to provide maximum durability, mda-rs fsyncs both the file and the containing directory by default. This does incur a performance penalty, so I have also included an option to use the file-sync-only approach if preferred. Since mda-rs was written to cater primarily to my personal needs, it currently has some limitations, or unimplemented functionality, if you will. The most prominent one is that it delivers to maildir only, since that's the only mailbox format I use. The second is that there is no built-in way to change the email data (e.g., add a header field) except by filtering through an external tool, since this is another feature I don't use. Here is a small taste of how a custom MDA would look like with mda-rs: use std::path::PathBuf; use mda::{Email, EmailRegex, Result, DeliveryDurability}; fn main() -> Result<()> { let root = PathBuf::from("/tmp/my-personal-mail"); let mut email = Email::from_stdin_filtered(&["/usr/bin/bogofilter", "-ep"])?; // Use quicker (but possibly less durable) delivery. email.set_delivery_durability(DeliveryDurability::FileSyncOnly); let from = email.header_field("From").unwrap_or(""); let bogosity = email.header_field("X-Bogosity").unwrap_or(""); if bogosity.contains("Spam, tests=bogofilter") || from.contains("@banneddomain.com") { email.deliver_to_maildir(root.join("spam"))?; return Ok(()); } let cc = email.header_field("Cc").unwrap_or(""); let to = email.header_field("To").unwrap_or(""); if to.contains("myworkemail@example.com") || cc.contains("myworkemail@example.com") { if email.body().search("URGENCY RATING: (CRITICAL|URGENT)")? { email.deliver_to_maildir(root.join("inbox/myemail/urgent"))?; } else { email.deliver_to_maildir(root.join("inbox/myemail/normal"))?; } return Ok(()); } email.deliver_to_maildir(root.join("inbox/unsorted"))?; Ok(()) }  and a corresponding minimal Cargo.toml: [package] name = "my-mda" version = "0.1.0" edition = "2018" [dependencies] mda = "0.1"  To provide an idea of the performance gains to expect, I benchmarked a us-ascii only version of my personal mail filtering rules on a sample of 250 of my recently received emails using all the aforementioned MDAs. Of course, the performance results will vary depending on both the rules and the email themselves, but in my experience what is presented below seems to be a typical outcome. I'd be very interested to know how mda-rs worked for others on the performance front. In the benchmark I included runs both with and without filtering for spam, since such filtering takes up a significant amount of processing and affects the relative results. For mda-rs I included two versions, one that mostly uses per-header-field searches, and a second one that uses regular expressions exclusively, and thus is a bit closer to how the other MDAs work. For each mda-rs version I ran the benchmark both in file sync only mode, which is how the others MDAs work, and full sync (file and directory) mode, which is the default for mda-rs. Note that, for this benchmark, both maildrop and mda-rs performed MIME decoding and translation to UTF-8 (internally and automatically), whereas neither procmail nor fdm did so, and no external program was used to provide such functionality. I verified that the delivery results were the same for all MDAs. mda-rs wins hands down when operating in file sync mode, while at the same time doing more work (normalizing the email) compared to the next faster contestant, procmail. In full sync mode, the extra directory syncs prove to be expensive enough to overshadow other mda-rs performance gains, and procmail takes the lead. Still, even in this case, the performance is acceptable and much better compared to both maildrop and fdm. If you would like to learn more about how to use mda-rs, you can find detailed documentation at: https://docs.rs/mda The mda-rs project repository is at: https://github.com/afrantzis/mda-rs The mda-rs crates.io entry is at: https://crates.io/crates/mda Enjoy! ## September 25, 2019 ### Andrew Shadura #### Rust-like enums in Kotlin Rust has an exciting concept of enumeration types, which is much more powerful than enums in other languages. Notably C has the weakest type of enum, since there’s no type checking of any kind, and enum values can be used interchangeably with integers: enum JobState { PENDING, STARTED, FAILED, COMPLETED };  You can opt for manually assigning integers instead of leaving this to the compiler, but that’s about it. Higher level languages like Python and Java treat enumeration types as classes, bringing stricted type checking and better flexibility, since they can be extended nearly as any other classes. In both Python and Java individual enumerated values are singleton instances of the enumeration class. class JobState(Enum): PENDING = auto() STARTED = auto() FAILED = auto() COMPLETED = auto()  enum JobState { PENDING, STARTED, FAILED, COMPLETED; }  Since enumerations are classes, they can define extra methods, but because the enum values are singletons, they can’t be coupled with any extra data, and no new instances of the enum class can be created. In contrast with Python and Java, Rust allows attaching data to enumerations: enum JobState { Pending, Started, Failed(String), Completed }  This allows us to store the error message in the same value as the job state, without having to declare a structure with an extra field which would be used only when the state in Failed. So, what Kotlin has to offer? Kotlin has a language feature called sealed classes. A sealed class is an abstract class with limited interitance: all of its subclasses have to be declated in the same file. In a way, this is quite close to the Rust enums, even though sealed classed look and behave a bit differently. sealed class JobState { object Pending : JobState() object Started : JobState() object Completed : JobState() data class Failed(val errorMessage: String) : JobState() }  Declared this way, JobState can be used in a way similar to Rust’s enums: a single variable of this type can be assigned singletons Pending, Started or Completed, or any instance of Failed with a mandatory String member: val state: JobState = JobState.Failed("I/O error") when (state) { is JobState.Completed -> println("Job completed") is JobState.Failed -> println("Job failed with an error:${state.errorMessage}")
}


This usage resembles the regular Java/Kotlin enums quite a bit, but alternatively, Pending and friends can be declared outside of the sealed class, allowing them to be used directly without the need to add a JobState qualifier.

A slightly simplified real life example from a Kotlin project I’m working on, where a separate coroutine handles I/O with a Bluetooth or a USB device:

sealed class Result
object Connected : Result()
data class Failed(val error: String) : Result()

sealed class CommServiceMsg
data class Connect(val response: CompletableDeferred<Result>) : CommServiceMsg()
object Disconnect : CommServiceMsg()
data class Write(val data: ByteArray) : CommServiceMsg()

fun CoroutineScope.bluetoothServiceActor(device: BluetoothDevice) = actor<CommServiceMsg>(Dispatchers.IO) {
val socket: BluetoothSocket = device.createSocket()

process@ for (msg in channel) {
when (msg) {
is Connect -> {
with(socket) {
msg.response.complete(try {
connect()
Connected
} catch (e: IOException) {
val error = e.message ?: ""
Failed(error)
}
}
}
is Disconnect -> break@process
is Write -> {
socket.outputStream.write(msg.data)
}
}
}
socket.outputStream.flush()
socket.close()
}


Here, we can talk to bluetoothServiceActor using messages each carrying extra data; if the coroutine needs to talk back (in this example, the result of a connection attempt), it uses a CompletableDeferred<> value of the Result type, which can hold an error message when needed.

With that in place, we can write something like this:

val bluetoothService = bluetoothServiceActor(device)
val response = CompletableDeferred<Result>()

bluetoothService.send(Connect(response))
var result = response.await()
when (result) {
is Connected -> {
bluetoothService.send(Write(byteArrayOf(42, 0x1e, 0x17)))
bluetoothService.send(Disconnect)
}
is Failed ->

## April 01, 2019

### Alyssa Rosenzweig

#### Kodi and SuperTuxKart on Panfrost

Back in October, Panfrost ran some simple benchmarks, like glmark. Five months later, Panfrost has grown from running benchmarks to real-world apps, like Kodi, and 3D games like SuperTuxKart and Neverball.

Since the previous post, there have been major improvements across every part of the aspect culminating in this milestone. On the kernel side, my co-contributors Tomeu Vizoso and Rob Herring have created a modern kernel driver, suitable for mainline inclusion. Panfrost now uses this upstream-friendly driver, rather than relying on a modified legacy kernel as in the past. The new kernel module is currently under review for mainline inclusion. You can read more about this progress on Tomeu’s blog.

Outside the kernel, however, the changes have been no less significant. Early development was constrained to our own project repositories, as the code was not yet ready to general users. In early February, thanks in part to the progress on the kernel-space, we flew out of our nest, and Panfrost was merged into upstream Mesa, the central repository for free software graphics drivers. Development now occurs in-tree in Mesa.

We have continued decoding new aspects of the hardware and implementing support in the driver. A few miscellaneous additions include cube maps, gl_PointSize and gl_PointCoord, linear depth rendering, performance counters, and new shader instructions.

One area of particular improvement has been our understanding of the hardware formats (like “4-element vector of 32-bit floats” or “single 16-bit unsigned normalized integer”). In Panfrost’s early days, we knew magic numbers to distinguish a few of the most common formats, but the underlying meanings of the voodoo patterns were elusive. Further, the format bits for textures and attributes were not unified, further hindering the diversity of supported formats available. However, Connor Abbott and I have since identified the underlying meaning of the format codes for textures, attributes, and framebuffers. This new understanding allows for the magic numbers to be replaced by a streamlined format selection routine, mapping Gallium’s formats to the hardware’s and supporting the full spectrum of formats required for a conformant driver. Panfrost is now passing texture format tests for OpenGL ES 2.0.

From a performance standpoint, various optimizations have been added. In particular, a fast path likely relating to the “tiler” in the hardware was discovered. When this fast path is used, performance on geometry heavy scenes skyrockets. In one extreme demo (shading the Stanford bunny), performance more than tripled, and these gains trickle down to real-world games.

Features aside, one of the key issues with an early driver is the brittleness and instability. Accordingly, to guarantee robustness, I now test with the drawElements Quality Program (dEQP), which includes comprehensive code correctness tests. Although we’re still a while away from conformance, I now systematically step through identified issues and resolved the bugs, translating to fixes across every aspect of the driver.

One real-world benefactor of these fixes is the Kodi media center, which today works well using Panfrost to achieve a fluid interface on Midgard devices. For standalone installations of Kodi, today there are experimental images featuring Kodi and Panfrost. To further improve fluidity, Kodi and Panfrost can even interoperate with the video decoding acceleration, contingent on cooperative kernel drivers.

For users more inclined to gaming, some 3D games are beginning to show signs of life with Panfrost. For instance, the classic (OpenGL ES 2.0) backend of the ever-popular kart racing game, SuperTuxKart, now renders with some minor glitches with Panfrost. Performance is playable on simple tracks, though we have many opportunities for optimization. To bring up this racing game, I added support for complex control flow in the compiler. Traditionally, control flow is discouraged in graphics, due to the architecture of desktop GPUs (thread “warps”). However, Midgard does not feature these traditional optimizations, negating the performance penalty for branching from control flow. The implementation required new bookkeeping in the compiler, as well as an investigation into long jumps due to the size of the game’s “uber-shader”. In total, this compiler improvement – paired with assorted bug fixes – allows SuperTuxKart to run.

Likewise, Neverball is playable (and fun!) with Panfrost, although there are rendering anomalies relating to the currently unimplemented legacy feature “point sprites”. In contrast to Kodi and SuperTuxKart, which make liberal use of custom shaders, Neverball is implemented with purely fixed-function desktop OpenGL. This poses an interesting challenge, as Midgard is designed specifically for embedded graphics; the blob does not support this desktop superset. But that’s no reason we can’t!

Like most modern free software OpenGL drivers, Panfrost is built atop the modular “Gallium” architecture. This architecture abstracts away interface details, like desktop versus embedded OpenGL, normalizing differences to allow drivers to focus on the hardware itself. This abstraction means that by implementing Panfrost as an embedded driver atop Gallium, we get a partial desktop OpenGL implementation “free”.

Of course, there is functionality in the desktop superset that does not exist in the embedded profile. While Gallium tries to paper over these differences, the driver is required to implement features like point sprites and alpha testing to expose the corresponding desktop functions. So, the bring-up of desktop OpenGL applications like Neverball has led me to implement some of this additional functionality. Translating the “alpha test” to a conditional discard instruction in the fragment shader works. Similarly, translating “point sprites” to the modern equivalent, gl_PointCoord, is planned.

Interestingly, the hardware does support some functionality only available through the full desktop profile. It is unknown how many “hidden features” of this type are supported; as the blob does not appear to use them, these features were discovered purely by accident on our part. For instance, in addition to the familiar set of “points, lines, and triangles”, Midgard can natively render quadrilaterals and polygons. The existence of this feature will suggested by the corresponding performance counters, and the driver-side mechanics were determined by manual bruteforce of the primitive selection bits. Nevertheless, now that these bonus features are understood, quads can be drawn from desktop applications without first translating to indexed triangles in software. Similarly, it appears in addition to the embedded standard of boolean occlusion queries, setting a chicken bit enable the hardware’s hidden support for precise occlusion counters, a useful desktop feature.

Going forward, although the implementation of OpenGL ES 2.0 is approaching feature-completeness, we will continue to polish the driver, guided by dEQP. Orthogonal to conformance, further optimization to improve performance and lower memory usage is on the roadmap.

It’s incredible to reflect back and realise just one year ago, work had not even begun on writing a real OpenGL driver. Yet here we are today with an increasingly usable, exclusive free software, hardware-accelerated desktop with Mali Midgard graphics.

Frost on.