Planet Collabora

May 29, 2020

Sjoerd Simons

Hacker Cat

htmltidy failed to parse this html

by Sjoerd Simons at May 29, 2020 12:52 PM

First post!

htmltidy failed to parse this html

by Sjoerd Simons at May 29, 2020 12:52 PM

April 23, 2020

Alyssa Rosenzweig

From Bifrost to Panfrost - deep dive into the first render

glmark2-es2 -btextureglmark2-es2 -btexture

In Panfrost’s infancy, community members Connor Abbott and Lyude Paul largely reverse-engineered Bifrost and built a proof-of-concept shader dis/assembler. Meanwhile, I focused on the Midgard architecture (Mali T600+), building an OpenGL driver alongside developers like Collaboran Tomeu Vizoso.

As Midgard support has grown – including initial GLES3 support – we have now turned our attention to building a Bifrost driver. We at Collabora got to work in late February, with Tomeu porting the Panfrost command stream, while I built up a new Bifrost compiler.

This week, we’ve reached our first major milestone: the first 3D renders on Bifrost, including basic texture support!

glmark2-es2 -bshading:shading=phong:model=horseglmark2-es2 -bshading:shading=phong:model=horse
glmark2-es2 -bbufferglmark2-es2 -bbuffer

The interface to a modern GPU has two components, the fixed-function command stream and the programmable instruction set architecture. The command stream controls the hardware, dispatching shaders and containing the state required by OpenGL or Vulkan. By contrast, the instruction set encodes the shaders themselves, as with any programmable architecture. Thus the GPU driver contains two major components, generating the command stream and compiling programs respectively.

From Midgard to Bifrost, there have been few changes to the command stream. After all, both architectures feature approximately the same OpenGL and Vulkan capabilities, and the fixed-function hardware has not required much driver-visible optimization. The largest changes involve the interfaces between the shaders and the command stream, including the titular shader descriptors. Indeed, squinting command stream traces from Midgard and Bifrost look similar – but the long tail of minor updates implies a nontrivial Panfrost port.

But the Bifrost instruction set, on the other hand? A rather different story.

Let’s dive in.

Compiler

Bifrost’s instruction set was redesigned completely from Midgard’s, requiring us to build a free software compiler targeting Bifrost from scratch. Midgard’s architecture is characterized as:

  • Vector. 128-bit arithmetic logic unit (ALU) allows 32-bit 4-channel SIMD.

  • Very Long Instruction Word. 5 blocks – scalar/vector add/multiply and special function unit – operate with partial parallelism across 2 pipeline stages.

Vector and VLIW architectures promise high-performance in theory, and in theory, theory and practice are the same. In practice, these architectures are extremely difficult to compile efficiently for. Automatic vectorization is difficult; if a shader uses a 3-channel 32-bit vector (vec3), most likely the extra channel will go unused, wasting resources. Likewise, scheduling for VLIW architectures is difficult and slots often go unused, again wasting resources and preventing shaders from reaching peak architectural performance.

Bifrost by contrast is:

  • Scalarish. 32-bit ALU. 32-bit operations are purely scalar, 16-bit/2-channel and 8-bit/4-channel SIMD. 64-bit operation requires a special half-performance mode.

  • Slightly Long Instruction Word. Bifrost has 2 blocks – fused multiply-add and add – pipelined without parallelism. Simplified special function unit.

This redesign promises better performance - and a redesign of Panfrost’s compiler, too.

Bifrost Intermediate Representation

At the heart of a modern compiler is one or more Intermediate Representations (IRs). In Mesa, OpenGL shaders are parsed into GLSL IR – a tree IR expressing language-visible types – which is converted to NIR – a flat static single assignment IR enabling powerful optimizations. Backends convert NIR to a custom backend IR, whose design seals the fate of a compiler. A poor IR design can impede the growth and harm the performance of the entire compiler, while a well-designed IR enables complex features to be implemented naturally with simple, fast code. There is no one-size-fits-all IR; the design necessarily is built to make certain compiler passes (like algebraic optimization) simple at the expense of others (like register allocation), justifying the use of multiple IRs within a compiler. Further, IR design permeates every aspect of the compiler, so IR changes to a mature compiler are difficult. Both Intel and AMD GPU compilers have required ground-up rewrites to fix IR issues, so I was eager to get the Bifrost IR (BIR) right to begin.

An IR is simply a set of data structures representing a program. Typically backends use a “flat” IR: a list of blocks, where each block contains a list of instructions. But how do we store an instruction?

We could reuse the hardware’s instruction format as-is, with abstract variables instead of registers. It’s tempting, and for simple architectures, it can work. The initial NIR-to-BIR pass is a bit harder than with abstract instructions, but final code generation is trivial, since the packing is already done.

Unfortunately, real world instructions sets are irregular and as quirky as their compilers’ developers. Further, they are tightly packed to be small instead of simple. For final assembly, we will always need to pack the machine formats, but with this design, we also need to unpack them. Worse, the machine irregularities will permeate every aspect of the compiler since they are now embedded into the IR. On Bifrost, for example, most operations have multiple unrelated encodings; this design would force much of the compiler to be duplicated.

So what if the IR is entirely machine-independent, compiling in the abstract and converting to the machine-specific form at the very end. Such IRs are helpful; in Mesa, the machine-independent NIR facilitates sharing of powerful optimizations across drivers. Unfortunately, some design traits really do affect our compiler. Is there a compromise?

Notice the first IR simplifies packing at the expense of the rest of the compiler, whereas the second simplifies NIR-to-BIR at the expense of machine-specific passes. All designs trade complexity in one part of the compiler for another. Hence a good IR simplifies the hardest compiler passes at the expense of the easiest. For us, scheduling and register allocation are NP-complete problems requiring complex heuristics, whereas NIR-to-BIR and packing are linear translations with straightforward rules. Ideally, our IR simplifies scheduling and register allocation, pushing the complexity into NIR-to-BIR and packing. For Bifrost, this yields one golden rule:

A single BIR instruction corresponds to a single Bifrost instruction.

While single instructions may move during scheduling and be rewired from register allocation, the operation is unaffected. On the other hand, within an instruction:

BIR instructions are purely abstract without packing.

By delaying packing, we simplify manipulation. So NIR-to-BIR is complicated by the one-to-many mapping of NIR to Bifrost opcodes with special functions; meanwhile, final code generation is complicated by the pattern matching required to infer packing from abstract instructions. But by pushing this complexity to the edges, the optimizations in between are greatly simplified.

But speaking of IR mistakes, there is one other common issue…

16-bit Support

One oversight I made in designing the Midgard IR – an oversight shared by the IR of many other compilers in Mesa – is often assuming instructions to operate on 32-bit data only. In OpenGL with older Mesa versions, this assumption was true as the default float and integer types are 32-bit. However, the assumption is problematic for OpenCL where 8-, 16-, and 64-bit types are first class. Even for OpenGL, it is suboptimal. While the specification mandates minimum precision requirement for operation, fulfillable with 32-bit arithmetic, on shaders using mediump precision qualifiers we may use 16-bit arithmetic instead. About a month ago, Mesa landed support for optimizing mediump fragment shaders to use 16-bit arithmetic, so for Bifrost, we want to make sure we can take advantage of these optimizations.

The benefit of reduced precision is two-fold. First, shader computations need to be stored in registers, but space in the register file is scarce, so we need to conserve it. If a shader runs out of registers, it spills to main memory, which is slow, so by using 16-bit types instead of 32-bit, we can reduce spilling. Second, although Bifrost is scalar for 32-bit, it is 2-channel vector for 16-bit. As mentioned, automatic vectorization is difficult, but if shaders are vectorized to begin, the compiler can take advantage of this.

As an example, to add 32-bit vector R0-R3 with R4-R7, we need code like:

ADD.f32 R0, R0, R4
ADD.f32 R1, R1, R5
ADD.f32 R2, R2, R6
ADD.f32 R3, R3, R7

But in 16-bit mode with vectors R0-R1 and R2-R3:

ADD.v2f16 R0, R0, R2
ADD.v2f16 R1, R1, R3

Notice both register usage and instruction count are halved. How do we support this in Mesa? Mesa pipes 16-bitness through NIR into our backend, so we must ensure types are respected across our backend compiler. To do so, we include types explicitly in our backend intermediate representation, which the NIR-to-BIR pass simply passes through from NIR. Certain backend optimizations have to be careful to respect these types, whereas others work with little change provided the IR is well-designed. Scheduling is mostly unaffected. But where are there major changes?

Enter register allocation.

Register allocation

A fundamental problem every backend compiler faces is register allocation, mapping abstract IR variables to concrete machine registers:

0 = load uniform #0
1 = load attribute #0
2 = add 0, 1
3 = mul 2, 2

\(\rightarrow\)

R0 = load uniform #0
R1 = load attribute #0
R0 = add R0, R1
R0 = mul R0, R0

Traditionally, register allocation is modeled by a graph, where each IR variable is represented by a node. Any variables that are simultaneously live, in the sense that both of their values will be read later in the program, are said to interfere since they require separate registers to avoid clobbering each other’s data. Interference is represented by edges. Then the problem reduces to graph colouring, finding colours (registers) for each node (variable) that don’t coincide with the colours (registers) of any nodes connected by edges (interfering variables). Initially, Panfrost used a graph colouring approach.

However, the algorithm above is difficult to adapt to irregular vector architectures. One alternative approach is to model the register file explicitly, modeling modeling interference as constraints and using a constraint solver to allocate registers. For the above program, letting \(k_i\) denote the register allocated to variable \(i\), there is a single constraint on the registers \(k_0 \ne k_1\), since \((0, 1)\) is the only pair interfering nodes, yielding a optimal valid solution \(k_0 = 0, k_1 = 1, k_2 = 0, k_3 = 0\), corresponding to the allocation above.

As-is, this approach is mathematically equivalent to graph colouring. However, unlike graph colouring, it extends easily to model vectorization, enabling per-component liveness tracking, so some components of a vector are allocated to a register while others are reused for another value. It also extends easily to vectors of varying type sizes, crucial for 16-bit support, whereas the corresponding generalization for graph colouring is much more complicated.

This work was originally conducted in October for the Midgard compiler, but the approach works just as well for Bifrost. Although Bifrost is conceptually “almost” scalar, there are enough corner cases where we dip into vector structures that a vector-aware register allocator is useful. In particular, 16-bit instructions involve subdivides 32-bit registers, and vectorized input/output requires contiguous registers; both are easily modeled with linear-style constraints.

Packing

The final stage of a backend compiler is packing (final code generation), taking the scheduled, register allocated IR and assembling a final binary. Compared to Midgard, packing for Bifrost is incredibly complicated. Why?

Vectorized programs for vector architectures can be smaller than equivalent programs for scalar architectures. The above instruction sequences to add a 4-channel vector corresponds to just a single instruction on Midgard:

VADD.FADD R0.xyzw, R0.xyzw, R1.xyzw

We would like to minimize program size, since accesses to main memory are slow and increasing the size of the instruction cache is expensive. By minimizing size, smaller caches can be used with better efficiency. Unfortunately, naively scalarizing the architecture by a factor of 4 would appear to inflate program size by 4x, requiring a 4x larger cache for the same performance.

We can do better than simply duplicating instructions. First, by simplifying the vector semantics (since we know most code will be scalar or small contiguous vectors), we eliminate vector write masks and swizzles. But this may not be good enough.

Bifrost goes beyond to save instruction bits in any way possible, since small savings in instruction encoding accumulate quickly in complex shaders. For example, Bifrost separates medium-latency register file accesses from low latency “port” accesses, loading registers into ports ahead of the instruction. There are 64 registers (32-bits each), requiring 6-bits to index a register number. The structure to specify which registers should be loaded looks like:

unsigned port_0 : 5;
unsigned port_1 : 6;

We have 6 bits to encode port 1, but only 5 bits for port 0. Does that mean port 0 can only load registers R0-R31, instead of the full range?

Actually, no - if port 0 is smaller than port 1, the port numbers are as you would expect. But Bifrost has a trick: if port 0 is larger than port 1, then the hardware subtracts 63 from both ports to get the actual port number. In effect, the ordering of the ports is used as an implicit extra bit, storing 12-bits of port data in 11-bits on the wire. (What if the port numbers are equal? Then just reuse the same port!)

Similar tricks permeate the architecture, a win for code size but a loss for compiler packing complexity. The good news is that our compiler’s packing code is self-contained and unit tested independent of the rest of the compiler.

Conclusion

Putting it all together, we have the beginnings of a Bifrost compiler, sufficient for the screenshots above. Next will be adding support for more complex instructions and scheduling to support more complex shaders.

Architecturally, Bifrost turns Midgard on its head, ideally bringing performance improvements but rather complicating the driver. While Bifrost support in Panfrost is still early, it’s progressing rapidly. The compiler code required for the screenshots above is all upstreamed, and the command stream code is working its way up. So stay tuned, and happy hacking.

Originally posted on Collabora’s blog

April 23, 2020 04:00 AM

March 24, 2020

Erik Faye-Lund

Introducing OpenCL™ and OpenGL® on DirectX

For the last few months, we have been working on two exciting new projects at Collabora, and it’s finally time to share some information about them with the world:

We are partnering with Microsoft DirectX engineers to build OpenCL and OpenGL mapping layers, in order to bring OpenCL 1.2 and OpenGL 3.3 support to all Windows and DirectX 12 enabled devices out there!

This work builds on a lot of previous work. First and foremost, we are building this by using Mesa 3D, with the Gallium interface as the base for the OpenGL layer, and NIR as the base for the OpenCL compiler. We are also using LLVM and the SPIRV-LLVM-Translator from Khronos as the compiler front-end.

In addition, we are taking advantage of Microsoft’s experience in creating their D3D12 Translation Layer, as well as our own experience from developing Zink.

What is Mesa 3D?

Mesa 3D is an open source implementation of several graphics technologies, including OpenCL and OpenGL. The OpenGL implementation in Mesa is robust and is used as the base for several industry-strength OpenGL drivers from multiple GPU vendors.

Among other things, Mesa consists of several API implementations (called state-trackers) as well as the Gallium low-level driver interface. The Gallium interface hides a lot of the legacy OpenGL details and translates OpenGL calls into something that looks more like modern GPU primitives.

Why translate APIs?

Not all Windows-powered devices have consistent support for hardware-accelerated OpenCL and OpenGL. So in order to improve application compatibility, we are building a generic solution to the problem. This means that a GPU vendor only has to implement a D3D12 driver for their hardware in order to support all three APIs.

This mapping layer is also expected to serve as a starting point in porting older OpenCL and OpenGL applications over to D3D12.

In addition, we believe this is good for the wider open source community. A lot of the problems we are solving here are shared with other drivers and translation layers, and we hope that the code will be useful beyond the use cases listed above.

Implementation

The work is largely split into three parts: an OpenCL compiler, an OpenCL runtime, and a Gallium driver that builds and executes command-buffers on the GPU using the D3D12 API.

In addition, there is a shared NIR-to-DXIL shader compiler that both components use. For those not familiar with NIR, it is Mesa’s internal representation for GPU shaders. Similarly, DXIL is Microsoft’s internal representation, which D3D12 drivers will consume and translate into hardware-specific shaders.

OpenCL compiler

The OpenCL compiler uses LLVM and the SPIRV-LLVM-Translator to generate SPIR-V representations of OpenCL kernels. These, in turn, are passed to Mesa’s SPIR-V to NIR translator, where some optimizations and semantical translations are done. Then the NIR representation is finally passed to NIR-to-DXIL, which produces a DXIL compute shader and the needed metadata so it can be executed on the GPU by the runtime using D3D12.

Here’s a diagram of the complete process, including NIR-to-DXIL, which will be described below:

OpenCL Compiler Overview OpenCL Compiler Overview

OpenCL runtime

While Mesa provides an OpenCL implementation called Clover, we are not using it for this project. Instead, we have a new OpenCL runtime that does a more direct translation to the DirectX 12 API.

NIR-to-DXIL

DXIL is essentially LLVM 3.7 bitcode with some extra metadata and validation. This was a technical choice that made sense for Microsoft because all the major driver vendors already used LLVM in their compiler toolchain. Using an older version of the LLVM bitcode format gives good compatibility with drivers because the LLVM bitcode format is backwards compatible.

Because we depend on a much more recent version of LLVM for the compiler front-end, we sadly cannot easily use the DirectX Shader Compiler as a compiler back-end. The DirectX Shader Compiler is effectively a fork of LLVM 3.7, and we are currently using LLVM 10.0 for the compiler front-end. Using DirectX Shader Compiler as that would require us to link two different versions of LLVM into the same binary, which would have led to problems.

We also cannot easily use LLVM itself to generate the bitcode. While the LLVM bitcode format is backwards compatible, LLVM itself is not forward compatible. This means that newer versions of LLVM cannot produce a bitcode format that is understood by older versions. This makes sense from LLVM’s point of view because it was never meant as a general interchange format.

So instead, we have decided to implement our own DXIL emitter. This is quite a bit harder than it looks because LLVM bitcode goes to great lengths to try to make the format as dense as possible. For instance, LLVM does not store its bitcode as a sequence of bytes and words, but rather as variable-width bitfields in a long sequence of bits.

There are a lot of tricky details to get right, but in the end we have a compiler that works.

D3D12 Gallium driver

The D3D12 Gallium driver is the last piece of the puzzle. Essentially, it takes OpenGL commands and, with the help of the NIR to DXIL translator, turns them into D3D12 command-buffers, which it executes on the GPU using the D3D12 driver.

There are a lot of interesting details that makes this tricky as well, but I will save those details for later.

But to not leave you empty-handed, here’s a screenshot of the Windows version of the famous glxgears, wglgears:

wglgears on DirectX12 wglgears on DirectX12

Source code

In the short term, the source code can be found here. We intend on upstreaming this work into the main Mesa repository shortly, so it is not a permanent home.

Next steps

This is just the announcement, and a whole lot of work is left to be done. We have something that works in some cases right now, but we are just starting to scratch the surface.

First of all, we need to get up to the feature-level that we target. Our goals at the moment is to pass conformance tests for OpenCL 1.2 and OpenGL 3.3. We have a long way to go, but with some hard work and sweat, I am sure we will get there.

Secondly, we need to work on application compatibility. For now we will be focusing on productivity applications.

We also want to upstream this in Mesa. This way we can keep up with fixes and new features in Mesa, and other drivers can benefit from what we are doing as well.

Acknowledgments

It is also important to point out that I am not the only one working on this. Our team consists of five additional Collabora engineers (Boris Brezillon, Daniel Stone, Elie Tournier, Gert Wollny, Louis-Francis Ratté-Boulianne) and two Microsoft DirectX engineers (Bill Kristiansen, Jesse Natalie).

by Erik Faye-Lund at March 24, 2020 12:00 AM

March 09, 2020

Alexandros Frantzis

Interactive time series graph of confirmed COVID-19 cases per country

I have been looking for more information about the trends of the number of confirmed COVID-19 cases per country. I could find the numbers for each day, but I wanted a visual representation of the information in order to readily compare the trends in various regions.

John Hopkins University (JHU) has set up a great dashboard tracking the COVID-19 cases worldwide. It contains a time series graph of the overall cases, but doesn't provide such graphs for each country. Thankfully, JHU have made all their data publicly available (for educational/academic purposes) in their JHU CSSE COVID-19 repository.

Since I couldn't find a visualization that matched my needs I decided to create my own interactive graph:

Interactive time series graph of confirmed COVID-19 cases per country

The graph is implemented as a web page that uses client-side processing (i.e., javascript) to plot the time series for one or more countries. The JHU data is fetched and processed by the client on each page load, so the graph will display the latest data without requiring any server-side updates.

For example, this is a plot from that page for France, Italy and Spain as of 2020-03-08:

france-italy-spain

Given the time offset between the trend lines, it's not always clear how the trends compare. To be able to compare trends more easily, I added an option to synchronize the displayed time series so that they start at (approximately) the same confirmed case count. Here is the same series as above but synchronized at 200 cases:

france-italy-spain-sync-200

This plot indicates that France and Italy are following roughly the same trend, whereas Spain is faring a bit better.

Before you draw any conclusions about trends that show up after synchronization, keep in mind two important caveats.

First, since our sampling granularity is one day, and jumps in confirmed cases between successive days can be large, starting points for each series after synchronization may in fact be different enough to skew results. For example, we may want to synchronize at around 200, but if one series jumps from 50 to 300, the starting point for that series is not going to be near 200.

Second, using different synchronization points may provide different views of the trends. The same France-Italy-Spain plot synchronized at 40 cases gives:

france-italy-spain-sync-40

In this plot we can see that France and Spain initially followed a similar trend, but then France started diverging upwards. This is somewhat apparent in the original, non-synchronized plot, but only because the France and Spain trend lines have a small time offset.

I hope people find this interactive graph informative!

by Alexandros Frantzis at March 09, 2020 12:00 AM

February 21, 2020

Andrew Shadura

Follow-up on the train journey to FOSDEM

Here’s a recap of my train journey based on the Twitter thread I kept posting as I travelled.

To FOSDEM…

The departure from Bratislava was as planned:

Ready to depart from Bratislava hl. st.Ready to depart from Bratislava hl. st.

Half an hour in Vienna was just enough for me to grab some coffee and breakfast and board the train to Frankfurt without a hurry:

Boarding a Deutsche Bahn ICE to Frankfurt am MainBoarding a Deutsche Bahn ICE to Frankfurt am Main

Unfortunately, soon after we left Linz and headed to Passau, the train broke down. Apparently, it powered down and the driver was struggling to reboot it. After more than an hour at Haiding, we finally departed with a huge delay:

ICE standing at a platform of a railway station at HaidingTrapped in Haiding near Linz

Since the 18:29 train to Brussels I needed to catch in Frankfurt was the last one that day, I was put into a hotel Leonardo across the street from Frankfurt Hbf, paid by Deutsche Bahn, of course. By the time of our arrival in Frankfurt, the delay was 88 minutes.

Hotel room in Frankfurt am MainHotel room in Frankfurt am Main

Luckily, I didn’t have to convince Deutsche Bahn to let me sleep in the morning, they happily booked me (for free) onto a 10:29 ICE to Brussels so I had an opportunity to have a proper breakfast at the hotel and spend some time at Coffee Fellows at the station.

Frankfurt Hbf building in the morningGuten Morgen Frankfurt
ICE 16 to Brussels waiting at platform 19About to depart for Brussels

Fun fact: Aachen is called Cáchy in Czech, apparently as a corruption of an older German form ze Aachen.

Platform sign saying Aachen Hbf with a double-decker red DB regional trainStopping at Aachen

Having met some Debian people on the train, I have finally arrived in Brussels, albeit with some delay. This, unfortunately meant that I haven’t gone to Vilvoorde to see a friend, so the regional tickets I bought online were useless.

Platform at Bruxelles-MidiFinally, Brussels!

… and back!

The trip home was much better in terms of missed trains, only if a tiny bit more tiring since I took it in one day.

Platform at Bruxelles-Midi with an ICE almost ready to be boardedLeaving Brussels on time

Going to Frankfurt, I’ve spent most of the time in the bistro carriage. Unfortunately, the espresso machine was broken and they didn’t have any croissants, but the tea with milk was good enough.

In the bistro carriageIn the bistro carriage

I’ve used the fifty minutes I had in Frankfurt to claim the compensation for the delay, which (€33) I received in my bank account the next week.

The ICE train to Wien Hbf is about to departThe ICE train to Wien Hbf is about to depart
The view out of the window: going along the river from Passau to LinzHerzlich willkommen in Österreich!

The ICE train at platform 11Arrived at Wien Hbf
The REX to Bratislava waiting at platform 4The last leg

Finally, exactly twelve hours and one minute after the departure, almost home:

The REX from Vienna arrived at platform 2Finally home

by Andrej Shadura at February 21, 2020 02:09 PM

January 31, 2020

Andrew Shadura

Using gcc sanitisers to get a nasty bug fixed

A couple of days ago a colleague at Collabora asked me to help create a Debian package for the tool he needed to complete his task. The tool happened to be an NXP code signing tool, used to sign OS images to be run on i.MX systems using ‘High Assurance Boot’.

As it often happens, the tool was distributed in a manner typical for many big corporations: no direct link to the tarball, custom buildsystem, compiled binaries in the same tarball as the sources. A big relief was that the tool has been distributed under a three-clause BSD license since version 3.3.0 (the sources were not provided at all before that).

The custom buildsystem meant it was not using the Debian build flags (CFLAGS := …, CC := gcc), so I had to plug a custom ‘toolchain’ Makefile into the right place with definitions relevant to Debian:

DPKG_EXPORT_BUILDTOOLS = 1

include /usr/share/dpkg/buildtools.mk

LD = $(CC)

COPTIONS += -std=c99 -D_POSIX_C_SOURCE=200809L -fPIC $(shell dpkg-buildflags --get CPPFLAGS) $(shell dpkg-buildflags --get CFLAGS)

LDOPTIONS += $(shell dpkg-buildflags --get LDFLAGS)

After some time of playing with the build system, the package was done and sent to a colleague pending an upload to Debian. What did worry me a bit was a weird warning the compiler gave me every time:

../../code/front_end/src/acst.c: In function ‘encrypt_images’:
../../code/front_end/src/acst.c:791:25: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
  791 |         uint8_t *hash = malloc(hash_size);
      |                         ^~~~~~~~~~~~~~~~~

We didn’t use encryption, so I didn’t feel motivated to investigate.

Bad move.

Next day, the colleague comes back to me with this:

I did test builds in Apertis SDK, Debian buster, Fedora 31, ALT Linux 9 -- only the latter environment produced the cst tool which generated correct signature for the file.

The rest produce a bit different signature with zeroes for the current kernel:

000007E9 00 30
000007EA 00 82
000007EB 00 01
000007EC 00 F9
00000D41 00 30
00000D42 00 82
00000D43 00 01
00000D44 00 F9

I found that the problem is related to compilation of code/front_end/src/cst.c.

The colleague sent me a tarball with the working binary and a simple test case.

First attempt

Okay, let’s see. What are those eight different bytes? The signing tool comes with a utility, csf_parser to look inside of the signed file and extract bits and pieces. With debug logs on, for the correctly signed file the logs say this:

CSF TAG: 0xD8
Parsing following values
0xD8 0x02 0x01 0x40 0x30 0x82 0x01 0xF9 0x06 …

The incorrectly signed one has this:

CSF TAG: 0xD8
Parsing following values
0xD8 0x02 0x01 0x40 0x00 0x00 0x00 0x00 0x06 …

What’s 0xD8? hab_types.h says:

#define HAB_TAG_SIG  0xd8         /**< Signature */

Apparently, the first four bytes are a header of the data chunk, so the differing bytes are part of the signature itself.

Let’s add some debug prints to see what’s going on. Oh no: now the difference is not just 8 bytes, but much more. Looks like there’s some memory corruption or undefined behaviour or maybe both.

Meanwhile, the colleague chimes in: he tried linking cst.o from ALT Linux with the rest of the program built on Debian, and produced a working binary.

The dead end: disassembly

My first thought was: oh, it’s just some code relying on undefined behaviour, what if I decompiled both object files and compared them?

In the past, when I needed to disassemble something I used IDA, but since it’s proprietary, I never had a license for it, and I don’t have it anymore anyway, it was not an option. It did, however, have a very good decompilation module which produced mostly readable C-like pseudocode.

When a couple of years ago I looked for a free software replacement for IDA, radare seemed like it had some potential, but it didn’t seem ready, and it wasn’t fully packaged for Debian. The things have changed, and not only radare is in Debian, but it also comes with a GUI, Cutter, which is very good, even though there’s always toom for improvement.

To my surprise, it turned out radare also has some decompilation functionality which worked really well.

Anyway, I won’t go further into it since this experiment didn’t give me any useful data to work with.

Trying it with gdb

Good, so the code at fault is somewhere close to the signing functions, so let’s try and debug it.

I rarely use the command-line gdb for anything else than printing backtraces, so here’s a small cheatsheet for the next time I need it:

  • break -func create_sig_file: set breakpoint at a function
  • finish: ‘step out’, run to the end of the current function and stop right after it returns
  • x/10xb pointer: print 10 hexadecimal bytes

This is what I have figured during my debugging session: the data gets corrupted at the end of the signing sequence, inside or before the cms_to_buf function invocation, which gets the signature data copied from the internal OpenSSL data structure into the output buffer:

(gdb) finish
Run till exit from #0  cms_to_buf (cms=cms@entry=0x555555598a60, bio_in=bio_in@entry=0x555555597ff0, 
    data_buffer=data_buffer@entry=0x7fffffffd380 "\003", data_buffer_size=data_buffer_size@entry=0x7fffffffd37c, 
    flags=flags@entry=17090) at ../../code/back_end/src/adapt_layer_openssl.c:364
0x0000555555564399 in gen_sig_data_cms (sig_buf_bytes=0x7fffffffd37c, sig_buf=0x7fffffffd380 "", 
    hash_alg=<optimised out>, key_file=0x555555578f70 "IMG1_1_sha256_2048_65537_v3_usr_key.pem", 
    cert_file=0x555555591440 "IMG1_1_sha256_2048_65537_v3_usr_crt.pem", in_file=<optimised out>)
    at ../../code/back_end/src/adapt_layer_openssl.c:487
487    in ../../code/back_end/src/adapt_layer_openssl.c
Value returned is $2 = 0
(gdb) x/10xb 0x7fffffffd380
0x7fffffffd380:    0x00    0x00    0x00    0x00    0x06    0x09    0x2a    0x86
0x7fffffffd388:    0x48    0x86

(gdb) finish
Run till exit from #0  cms_to_buf (cms=0x445290, bio_in=0x444800, 
    data_buffer=0x7fffffffcbe0 "P\320\377\377\377\177", data_buffer_size=0x7fffffffcbcc, flags=17090)
    at ../../code/back_end/src/adapt_layer_openssl.c:365
0x00000000004109f6 in gen_sig_data_cms (in_file=0x416553 "imgsig.bin", 
    cert_file=0x43dc50 "IMG1_1_sha256_2048_65537_v3_usr_crt.pem", 
    key_file=0x43d640 "IMG1_1_sha256_2048_65537_v3_usr_key.pem", hash_alg=SHA_256, 
    sig_buf=0x7fffffffcbe0 "0\202\001\371\006\t*"..., 
    sig_buf_bytes=0x7fffffffcbcc) at ../../code/back_end/src/adapt_layer_openssl.c:487
487    in ../../code/back_end/src/adapt_layer_openssl.c
Value returned is $5 = 0
(gdb) x/10xb 0x7fffffffcbe0
0x7fffffffcbe0:    0x30    0x82    0x01    0xf9    0x06    0x09    0x2a    0x86
0x7fffffffcbe8:    0x48    0x86

Trying it with splint and valgrind

Having failed to find the data corruption by manually inspecting the source, I tried enabling every compiler warning I could, which gave me no results, and then decided to use external tools to help me with that.

First comes splint, a C linter. It produced about a thousand warnings and errors which, while probably were at least partially valid, completely overwhelmed me.

Next came valgrind, which, even at the most aggressive settings only pointed me at uninitialised padding data in some structures:

number:  NUMBER
        {
            $$ = malloc(sizeof(number_t));
            $$->next = NULL;
            $$->num_value = $1;
        }
        ;

And, of course, OpenSSL was reading uninitialised data to improve the randomness.

Not very helpful.

Almost giving up

Since I have spent most of the day debugging, valgrinding, reading code and worrying about my soon to be missed connection to Brussels in Frankfurt, I became very frustrated about having no useful results and was close to giving up and calling it a day.

As a last chance attempt, I decided to try all hardening options I could find in the GCC manual page and on the Debian wiki.

First, DEB_BUILD_MAINT_OPTIONS = hardening=+all, -fPIE -pie. Not much changed. Or rather nothing at all.

Then, I found sanitisers. Okay, dump it all into the makefile: -fsanitize=address -fsanitize=undefined -fsanitize=leak -fno-omit-frame-pointer. Compile, link, run: nothing works. Apparently, they need link options as well: -static-libasan -static-libubsan -static-liblsan

Colourful address sanitiser outputColourful address sanitiser output

Cracked it!

/**
 * sig_size as input to gen_sig_data shows the size of buffer for
 * signature data and gen_sig_data returns actual size of signature
 * data in this argument
 */
uint32_t sig_size = SIGNATURE_BUFFER_SIZE;

...

/**
 * Calling gen_sig_data to generate signature for data in file using
 * certificate in cert_file. The signature data will be returned in
 * sig and size of signature data in sig_size
 */
ret_val = gen_sig_data(file, cert_file, hash, sig_fmt,
    sig, (size_t *)&sig_size, g_mode);

Of course, size_t is 32-bit on 32-bit systems, but significantly wider on 64-bit ones, so no surprise this code would overwrite something else on the stack, in this case the signature data.

One of the lessons I learnt from this way: never underestimate the compiler’s built-in tools. At least, not the sanitisers!

Oh, just in case you want to see more: the patches currently live at the Apertis GitLab.

2020-02-18 Update: This post has also been published at Collabora’s website: https://www.collabora.com/news-and-blog/blog/2020/02/18/using-gcc-sanitisers-to-get-a-nasty-bug-fixed

by Andrej Shadura at January 31, 2020 11:35 AM

January 30, 2020

Sebastian Reichel

Collabora & Linux Kernel 5.5

Linus tagged the 5.5 kernel release, so let’s have a look at Collabora’s participation. Adrian, Martyn and Gustavo Padovan worked on the Broadcom brcmfmac WiFi driver fixing suspend support for some platforms which cut power during sleep to save battery and removing an incorrect warning message. These were done as part of a broader effort to improve upstream support of peripherals used together with the i.MX 6 family of processors.

January 30, 2020 02:42 PM

January 28, 2020

Andrew Shadura

FOSDEM by train

I’ve always loved train journeys, but with flygskam changing people’s travel preferences across Europe (and possibly worldwide, though probably not that much), I decided to take train to FOSDEM this time.

When I first went to FOSDEM which, just in case you don’t know, happens each February in Brussels at ULB, I flew with Ryanair from Bratislava to Charleroi because it was cheaper. After repeating the same journey a couple of times and I once nearly missed the last bus coach to Brussels because of a late flight, and decided to rather pay more but travel with more comfort to Brussels Zaventem, the main airport of Brussels. It’s well-connected with Brussels, trains run fast and run often, which is a significant upgrade in comparison to Charleroi, where the options were limited to bus coaches and a slow train connection from Charleroi the town.

As some of my readers may know, my backpack was stolen from me after FOSDEM two years ago, and with it were gone, among other things, my passport and my residence permit card. With my flight home having been planned two and half hours from the moment when I realised my things are gone, I couldn’t get a replacement travel document quickly enough from the embassy, so I had to stay at my friends in Vilvoorde (thanks a lot again, Jurgen!) and travel with the cheapest ground transportation I could find. In my case, it was a night RegioJet coach to Prague with a connection to (again) RegioJet train to Bratislava. (I couldn’t fly even though I already had my temporary travel document since I might need to somehow prove that I’m allowed to be in the Schengen zone, which is difficult to do without a valid residence permit.) Sleeping on a bus isn’t the best way to travel for long distances, and I was knackered when I finally dropped on my sofa in Bratislava next morning. However, what I learnt was that it was possible, and were it a bit more comfortable, I wouldn’t mind something like this again.

Here I must admit that I’ve travelled by long-distance trains quite a fair bit: in my childhood we went by train to summer holidays to Eupatoria in Crimea and Obzor in Bulgaria (through Varna). Both journeys took days, and the latter also involved a long process of changing the bogies on the Moldovan-Romanian border (Giurgiulești/Galați). Since I moved to Slovakia, I many times took the night train from Minsk to Warsaw with a connection to Bratislava or Žilina, a journey which usually takes at least 18 hours. Which is to say, I’m not exactly new to this mode of travel.

With the Austrian railway company ÖBB expaning their night train services as a part of their Nightjet brand, the Vienna to Brussels sleeper returned to service last week. Prices are still a bit higher than I would have preferred (at the time of me writing this, ticket in a compartment with 6 couchettes start at €79, but it’s not as bad as it could be (apparently the last minute price is more than €200). Anyway, when I decided to go to Brussels by train, this service didn’t exist yet, so instead I followed the very useful tips from the Man in the Seat 61 and booked a day-time connection: Bratislava to Vienna to Frankfurt to Brussels.

Train route from Bratislava to Vienna to Frankfurt to BrusselsTrain route from Bratislava to Vienna to Frankfurt to Brussels

Date Station Arrival Departure Train
30.1 Bratislava hl.st. 9:38 REX 2511
Wien Hbf 10:44 11:15 ICE 26
Frankfurt am Main Hbf 17:36 18:29 ICE 10
Bruxelles-Nord / Brussel Noord 21:26 21:37 IC 3321
Vilvoorde 21:45
Date Station Arrival Departure Train
4.2 Bruxelles-Midi 8:23 ICE 13
Frankfurt am Main Hbf 11:31 12:22 ICE 27
Wien Hbf 18:45 19:16 R 2530
Bratislava hl.st. 20:22

I’ve booked two through tickets since in this case the Super Sparschiene discount offered lower prices than normally an international return ticket would offer. For some reason neither ZSSK (Slovak railways) nor ÖBB offered ticket online (or for a comparable price in a ticket office anyway), so I booked online with Deutsche Bahn for €59.90 each way. This sort of ticket, while bookable online, had to be posted for a €5.90 extra.

Since I’m staying the first night at friend’s in Vilvoorde again, I also had buy a ticket for this small stretch of the track from the Belgian railways directly.

See you at FOSDEM!

Building AWBuilding AW
Inside the K buildingInside the K building
Grand PlaceGrand Place

by Andrej Shadura at January 28, 2020 07:18 PM

December 16, 2019

Alexandros Frantzis

Efficient saving of multi-source buffers

Preface

Several years ago I was working on libbls, a library implementing editable buffers that can efficiently hold contents originating from both memory and file sources. Typical clients of this libraries are programs that need to edit arbitrary portions of very large files, like hex editors or wave editors, while minimizing the required memory usage.

One of the more interesting problems I encountered was devising an efficient way to save buffers to their corresponding files in place, while minimizing the extra memory (or disk space) used in the process. This post describes how this problem, which turns out to be NP-hard, can be modeled as a graph, and how we can provide a reasonable solution by using a variation of a standard graph algorithm.

Introduction

Buffers in libbls are described in terms of segments that contain data from either memory or file sources. In order to conserve memory (and sometimes to even make the editing feasible) the data from file sources is not loaded into main memory. Instead, the buffer keeps information about the contained files and ranges and reads the data in vm-page-sized chunks when needed. So, a buffer could look like:

B:  |F1:0-9   |M1:10-18|F1:10-19 |M2:10-19|F2:0-6|

Where F1 and F2 are file sources, M1 and M2 are memory sources and S:B-E denotes a segment representing bytes from the range [B, E] of source S.

In the simple case, saving a buffer to a file consists of just reading data from the various sources and writing them out to the target file. Things become more complicated when the target file happens to be one of the file sources used in the buffer.

This post aims to illustrate the issues that can arise in such cases and propose an elegant way to resolve them.

Illustrating the problem

Saving the buffer to a file typically involves reading each segment in order and writing it out to the file. However, if the segment source is the same as the file that we are writing to, and depending on the ordering of segments, we may face a serious problem.

To illustrate the problem, consider a buffer B containing segment [0-9] of file F at range [10-19] of the buffer, and assume that we want to write the buffer back to file F:

    S
F:  |0-9      |.........|

              S             
B:  |.........|F:0-9    |
    ^         ^         ^
    0         10        20

Writing the buffer segments in order leads to a situation where we first write out buffer range [0-9] to file F, overwriting the contents of the file at that range. So, when moving on to the next segment, trying to write out the range [10-19] of the buffer, the data we are going to read from the file (and therefore write) is not the original data of [F:0-9], but the data belonging to the previous segment of the buffer we just wrote at positions [0-9].

The solution in this case is to first write out range [10-19] of the buffer to the file, so that the data we use is the original file data, and only then write out any other segments.

The problem can become even more complicated when we have more segments from the target file. A particularly interesting case arises when a segment from the buffer which belongs to the target file happens to overlap with another segment of F which is also present in the buffer:

    S                   T
F:  |0-9      |.........|20-29    |.........|
                      
            T                     S 
B:  |.......|F:20-29  |...........|F:0-9    |
    ^       ^         ^           ^
    0       8         18          30

In this case segment S in F overlaps with segment T in B. We can simply try to adapt the strategy used in the previous case and first write out the two target file segments. This works but only if we are extra careful. In this case, if we first write segment T then when we try to read the data of segment S we will read wrong data (file range 8-9 will contain data from segment T). If we do it the other way around everything works wonderfully.

Taking the last example one step further, consider what happens if we have cyclic overlaps:

    S                   T
F:  |0-9      |.........|20-29    |....|

            T                S 
B:  |.......|F:20-29  |......|F:0-9    |
    ^       ^         ^      ^     ^
    0       8         18     25    30

Now segment S in F overlaps with segment T in B, and segment T in F overlaps with segment S in B. In this case there is no way to order the writes of segments S and T so that we end up with the correct data.

A straightforward but wasteful way of tackling this problem is to save the whole buffer to a temporary file and then move the file to its final location. This works, and in some cases is preferred since it has the benefit of atomicity, but requires extra space for the temporary file. When buffers reach many GiB in size this method may prove to be unfeasible.

A more efficient method is to try to eliminate the cyclic overlaps by removing at least one of the overlaps involved. In the previous case we can replace segment T in B with two segments M and T1 so that no overlap occurs. Segment M will contain a part of T's data but will actually store them in memory (or a temporary file if necessary) and the T1 segment will just be a T minus the range stored in M. This approach still requires some amount of extra memory, but in most cases this amount is much smaller than the size of the whole buffer. Using this scheme we have:

    S                     T1
F:  |0-9      |...........|22-29  |....|

            M T1              S 
B:  |.......|.|F:22-29|......|F:0-9    |
    ^       ^         ^      ^     ^
    0       8         18     25    30

We have managed to eliminate the first overlap and now we can safely save the buffer by first writing T1 then S and then the remaining segments of B.

Modeling the problem

The examples presented in the previous section show representative but simple cases of the problems we can run into. In order to be able to handle cases of arbitrary complexity it is useful to create a model of the problem. We can then use this model to come up with algorithms that provide generic solutions.

The model we are going to present uses graphs. In these overlap graphs vertices represent segments from the file we are going to save to, that are also present in the buffer we are going to save. Edges between vertices denote segment overlap: an edge from vertex P to vertex Q denotes that segment P in the target file overlaps with segment Q in the buffer. Every edge carries with it the size of the overlapping range.

Here are some buffers and their respective overlap graphs (P* denotes a self-loop on P):

    P              Q                   R
F:  |0-9      |....|15-24    |.........|35-42  |

                             P
B1: |........................|F:0-9    |.......|   P

              P              Q
B2: |.........|F:0-9     |...|F:15-24  |.......|   Q --> P

         P                Q
B3: |....|F:0-9    |......|F:20-29  |..........|   P* Q*

    R        P                    Q
B4: |F:35-42|F:0-9    |...........|F:20-29  |..|   R <-- P* <-- Q
                                                    \___________^

Solving the problem

Using the overlap graph we can now answer the question: in what order should the vertices (segments) be written to the file so that the correct data is written?

If the graph has no cycles then we can process the vertices in topological order. Topological order guarantees that each vertex is processed only when its dependencies have already been processed and written to file, and thus that no unwritten file segment in the buffer overlaps with the destination range of that vertex.

As a special case, we can process a vertex that has a self-loop but no other incoming edge. This is achieved by first writing to the file the non-overlapping part of the vertex and then the rest.

In order to handle graphs with cycles (except self-loops) we must find a way to break the cycles. This can be achieved by removing the correct edges. An edge from P to Q can be removed by replacing Q in the buffer by the segments Q1 and M as described previously. M contains an in memory copy of data of the overlapping range. Q1 is the part of Q that remains. This technique eliminates the overlap (because the overlapping part is now stored in memory) and removes the edge. By choosing the edges to remove wisely, we can minimize the amount of extra memory we are going to need.

Once we have an acyclic graph (self-loops allowed) we can process it as described previously.

Let's manually apply this method to an example:

    P              Q                   R
F:  |0-9      |....|15-24    |.........|35-42  |

    R        P                    Q
B4: |F:35-42|F:0-9    |...........|F:20-29  |..|   R <-- P* <-- Q
                                                    \___________^

We first have to make the graph acyclic. We have a few choices for which edge to break:

  • Q->P, a 3 byte overlap
  • P->R, a 7 byte overlap
  • R->Q, a 5 byte overlap

To minimize the required extra memory we break the Q->P edge by storing the last three bytes of P in memory as segment M, so we get:

    P1             Q                   R
F:  |0-6   |.......|15-24    |.........|35-42  |

    R       P1     M              Q
B4: |F:35-42|F:0-6 |..|...........|F:20-29  |..|   Q <-- R <-- P1

Note that coincidentally we also resolved the self-loop of P. Now that we have an acyclic graph we can process the buffer segments in topological order, thus we first store P1 in range [8-14] of the file, next is R at [0-7] and finally Q at range [30-40] followed by all other portions of the buffer.

Implementation Details

libbls implements the solution described above when saving a file in place. The high level logic for the algorithm resides in bless_buffer_save. The implementation of the algorithm within that function can be conceptually split into three steps. Some implementation details for each step are given below, along with links to the functions that provide the relevant functionality.

Step 1: Creating the overlap graph

Relevant functions:

buffer_file.c: create_overlap_graph
overlap_graph.c: overlap_graph_add_segment

To create the overlap graph we have to scan the buffer segment collection for segments that belong to the file we want to save to. Each new segment that is found must be checked against all previous found file segments for overlap. This is an O(n + k*f(k)) algorithm where n is the number of segments in the segment collection and k the number of segments belonging to the file. If we check for overlap naively then f(k) = O(k) and therefore the algorithm is O(n + k*k). We can use an interval tree so that f(k) = O(logk) and the algorithm becomes O(n + k*logk). As we don't expect k to grow too large, the implementation currently uses the naive approach for simplicity.

Step 2: Making the graph acyclic

Relevant functions:

overlap_graph.c: overlap_graph_remove_cycles

In the second step the goal is to remove cycles from the graph, while trying to minimize the total weight of the removed edges in order to minimize extra memory usage. However, it turns out that the problem we are trying to solve, minimum feedback arc set, is NP-Hard! Don't despair though; we can get reasonable results with only little extra work.

The undirected counterpart of the minimum feedback arc set problem is the minimum (or maximum) spanning tree (MST) problem, which can be efficiently solved by using various methods. We can use these methods to give a solution (but not the best) for our problem, too. We can do that because MST algorithms remove all undirected cycles and therefore all directed ones, too. The caveat is that they remove more that we need them to; undirected cycles that are not directed.

MST algorithms work by processing edges in non-decreasing or non-increasing order of weight and adding them to the already formed tree (Prim) or trees (Kruskal) if they aren't already connected to them by an undirected path. Because our goal is to avoid directed cycles we can be more lax with the rules of adding an edge. A simple but useful observation is that if the destination node of an edge has no outgoing edges then a directed cycle cannot be formed, regardless of whether there is a pre-existing undirected path between the two nodes. Similarly, if the source node of an edge has no incoming edges a directed cycle cannot be formed either, regardless of the pre-existence of an undirected path between the source and destination nodes.

We can use the previous observations to decrease the number of edges that are incorrectly deleted by the MST algorithms when acting on directed graphs. Although we don't completely eliminate the erroneous deletions, these rules give reasonable results while keeping the complexity the same as in the undirected case.

Of the MST algorithms, Kruskal's algorithm can be used unchanged in directed graphs because it works directly on edges and doesn't care about their direction. It also works on disconnected graphs, since it finds a spanning forest. For these reasons, it was selected as the basis for the algorithm used by libbls.

Step 3: Removing edges and saving to file

Relevant functions:

overlap_graph.c: overlap_graph_get_removed_edges
buffer_file.c: break_edge
buffer_util.c: segcol_store_in_memory, segcol_store_in_file
overlap_graph.c: overlap_graph_get_vertices_topo

In the previous step we calculated which edges would need to be removed to create an acyclic graph. In this step we first perform the actual removal of these edges by storing the associated data in memory or a temporary file as described previously.

Removing the edges has the unfortunate side effect of further altering the graph as segments are removed, added or changed in the segment collection. This means that after removing the edges we must re-calculate the overlap graph for the segment collection. There may be some way to avoid re-calculating the whole graph, making only local changes to the existing graph as vertices are altered but we will use the simple (although more costly) way for now.

We then use this new overlap graph, which is guaranteed not to have any cycles, to get the vertices in topological order and save them to the target file.

by Alexandros Frantzis at December 16, 2019 12:00 AM

October 28, 2019

Guillaume Tucker

KernelCI and the Linux Foundation

A KernelCI developer's view on the project joining the Linux Foundation

KernelCI

As announced today at the Open Source Summit in Lyon (see the official press release), the KernelCI project which powers kernelci.org with automated testing for the upstream Linux kernel has now joined the Linux Foundation. This is the new home the project has finally found after sailing through uncharted waters for over five years.

What does it actually mean for the project?

Given the huge scale at which the Linux kernel is being used, achieving comprehensive test coverage for it is an incredibly challenging goal. Based on the open source philosophy principles, KernelCI's distributed architecture makes this possible by enabling the whole kernel community to collaborate around a single upstream CI system. Becoming part of the Linux Foundation will let the project flourish and become in turn integral part of the Linux kernel development workflow.

Some actual results of this move can already be seen with the new common database for storing test results from KernelCI and other related projects that share a common goal such as Red Hat's CKI and Linaro's LKFT. It is an experimental step towards expanding KernelCI in a modular way to interact with other existing systems. There are as many different ways to test the kernel as there are use-cases for it, and many types of specialised systems to cover: a CPU architecture such as 0-day, a Linux distribution such as CKI, a range of reference platforms such as LKFT...

What happens next?

This is only a new beginning, many things are yet to come with many decisions to be made: how to use the budget available from the membership scheme, how to make the infrastructure more sustainable and scalable, how to compromise and address the needs of all the members joining the project... Answers to all these questions are likely to appear as the coming months and years unfold.

One thing we can be sure of is that there is no reason for the current development plans to stop or be impacted in any way - on the contrary. There is indeed a strong need to extend test coverage and the capabilities of KernelCI at large, with a huge potential to improve upstream kernel development as a direct result. Becoming part of the Linux Foundation should be about facilitating progress in this direction above all.

What does Collabora plan to do?

Collabora has been involved with KernelCI since almost the beginning, providing some of the servers running the kernelci.org services and a large LAVA test lab. We've also become the biggest contributor to KernelCI development with myself as de facto maintainer for the core repositories on Github and in charge of the weekly kernelci.org updates.

Among all the things we have done in the last couple of years, these are probably the most significant ones:

  • Enabled automated bisection for boot regressions

  • Added v4l2-compliance test suite and a few more minor ones (suspend/resume...)

  • Enabled "Depthcharge" bootloader in LAVA to run KernelCI tests on Chromebooks

  • Portable command-line tools and YAML-based configuration to perform all the KernelCI steps in any environment

Going forward, here are some clear items on the project's current roadmap which will carry on full steam ahead with a growing team of developers:

  • Further improve handling of test results: web interface, email reports, regression tracking

  • Extend coverage of test suites: kselftest, LTP, KUnit, i-gpu-tools and many more

  • Deploy new bisection tool to cover complex test suite results

  • Improve modular core architecture to enable inter-operability between the various elements that form the overall KernelCI pipeline

Then like with every open source project, other contributors will also add more features, enable new platforms in the global test hardware pool or bring new ideas. The combined efforts of all parties is what makes the project unique and mature enought to reach its ultimate goal. We will make sure we continue to be among the top active players within the growing KernelCI community, as a founding member of the Linux Foundation project as well as with myself being both a maintainer and on the technical steering committee.

by Guillaume Tucker (guillaume.tucker@gmail.com) at October 28, 2019 12:00 PM

October 24, 2019

Erik Faye-Lund

Zink: Fall Update

I recently went to XDC 2019, where I gave yet another talk about Zink. I kinda forgot to write a blog-post about it, so here’s me trying to make up for it… or something like that. I’ll also go into some more recent developments as well.

My presentation was somewhat similar to the talk I did at SIGGRAPH this year, but with a bit more emphasis on the technical aspect, as the XDC audience is more familiar with Mesa internals.

If you’re interested, you can find the slides for the talk here. The talk goes through the motivation and basic approach. I don’t think I need to go through this again, as I’ve already covered that before.

As for the status, Zink currently supports OpenGL 2.1 and OpenGL ES 2.0. And there’s no immediate plans on working on OpenGL 3.0 and OpenGL ES 3.0 until Zink is upstream.

Which gets us to the more interesting bit; that I started working on upstreaming Zink. So let’s talk about that for a bit.

Upstreaming Zink

So, my current goal is to get Zink upstream in Mesa. The plan outline in my XDC talk is slightly outdated by now, so here I’ll instead say what’s actually happened so far, and what I hope will follow.

Before I could add the driver itself, I decided to send all changes outside of the driver as a set of merge-requests:

The last one is probably the most interesting one, as it moves a lot of fixed-function operations into the state-tracker, so individual drivers won’t have to deal with them. Unless they choose to, of course.

But all of these has already been merged, and there’s just one final merge-request left:

This merge-request adds the driver in its current state. It consists of 163 commits at the time of writing, so it’s not a thing of beauty. But new drivers usually aren’t, so I’m not too worried.

When this is merged, Zink will finally be a “normal” part of Mesa… Well, sort of anyway. I don’t think we’ll enable Zink to be built by default for a while. But that’ll just be a matter of adding zink to the -Dgallium-drivers meson-option.

Testing on CI

The current branch only adds building of Zink to the CI. There’s no testing being done yet. The reasons for this is two-fold:

  1. We need to get a running environment on CI. Rather of bringing up some hardware-enabled test-runner, I intend to try to set up SwiftShader as a software rasterizer instead, as that supports Vulkan 1.1 these days.
  2. We need some tests to run. Zink currently only supports OpenGL 2.1, and sadly the conformance test suite doesn’t have any tests for OpenGL versions older than 3.0. Piglit has some, but a full piglit-run takes significantly more time, which makes it tricky for CI. So right now, it seems the OpenGL ES 2.0 conformance tests are our best bet. We’ll of course add more test-suites as we add more features.

So, there’s some work to be done here, but it seems like we should be able to get something working without too much hassle.

Next steps

After Zink is upstream, it will be maintained similarly to other Mesa drivers. Practically speaking, this means that patches are sent to the upstream repo rather than my fork. It shouldn’t make a huge difference for most users.

The good thing is that if I go away on vacation, or are for some other reason unavailable, other people can still merge patches, so we’d slightly reduce the technical bus-factor.

I’m not stopping developing Zink at all, but I have other things going on in my life that means I might be busy with other things at times. As is the case for everyone! :wink:

In fact, I’m very excited to start working on OpenGL 3.x and 4.x level features; we still have a few patches for some 3.x features in some older branches.

The future is bright! :sunny:

by Erik Faye-Lund at October 24, 2019 02:22 PM

October 07, 2019

Alexandros Frantzis

mda-rs: Custom Mail Delivery Agents in Rust

I had been stubbornly using procmail for two decades to filter my email, before deciding that it's finally time for a change. Procmail served me well through the years, but, little by little, the annoyances started adding up. The fact that procmail is unmaintained since 2001 inspired less and less confidence as years passed, and the requirement in many cases for external processing before any useful matching could be performed (e.g., dealing with MIME content transfer encoded data, or non us-ascii character sets) was a constant pain point. As the complexity of my rules grew, I even had to resort to an external program (a custom C++ program in my case) to perform some of the mailbox selection logic, using the AUTOFOLDER feature of procmail.

At that point, and given all the functionality that I had to implement myself, I seriously started questioning the value procmail was providing to me and started looking for alternatives. I evaluated fdm and maildrop, finding in both a lot that I liked; first and foremost a less arcane filtering language. In the end, I found maildrop to be a closer match to my preferences and requirements, and I especially appreciated the automatic MIME content decoding.

I briefly considered switching to maildrop, but a few experiments on my set of filtering rules indicated that maildrop's performance was significantly worse compared to procmail, even though for procmail I had to call out to a few more external programs to achieve similar functionality. I also gave fdm a go, and it was even slower than maildrop. I receive a lot of emails each day, mostly from various FOSS mailing lists, and the performance penalty adds up. While waiting for a few dozen seconds daily wouldn't have been the end of the world, the thought that my original and much older solution was faster, disturbed me.

Meanwhile, I noticed that the control flow statements in maildrop's filtering language were similar to that of a general purpose programming language, and, in fact, with the integration with regular expressions, they looked a bit like perl. So, I thought, what if instead of using a domain specific language, I could use a general purpose language, augmented by some kind of library to implement the same filtering and delivery functionality. My thoughts then drifted to other things, but the seed of that idea had already been planted in my mind it seems, as a few years later I found myself implementing the code that would become mda-rs.

With mda-rs I wanted to create an experience as close as possible to using an interpreted domain specific language, the approach follow by typical MDAs, while still having the performance and power of a full, compiled language at the fingertips. One aspect of this experience was providing an API that feels like natural fit for the intended purpose. The other aspect was providing a straightforward way to build a custom MDA. For this second aspect, the simplicity of Rust's cargo was one of the reasons I decided to use Rust for this project.

Another decision catering to a similar ease-of-use goal was that the user shouldn't be required to use external programs to transform the data just so they could perform basic matching. To this end, mda-rs, like maildrop, normalizes the email before processing, by decoding and converting all text MIME content parts (including MIME encoded-words in headers) to plain UTF-8.

Of course, I also wanted the resulting custom MDAs to be fast; performance was my main disappointment with other MDAs after all. Performance requires an efficient implementation, but also and an API that encourages performant use. An example of the effect of such a concern on the mda-rs API, is that the API provides access to the header fields by name, so that one can perform targeted per-header-field string searches, which can be much faster than regular expression searches of the whole header.

Finally, an important requirement for all MDAs is that the email delivery is durable, i.e., that no email will be lost in case of a system failure. In other words, at any point in time at least one of the local filesystem and the email server should have a complete and fully accessible copy of the email.

While investigating the best way to provide such durability guarantees, I noticed that all three MDAs mentioned previously are fsyncing the written email file, as expected. However, they are not fsyncing the containing directory, which could potentially lead to the file not appearing on the filesystem 1. The exact guarantees in this scenario seem to be dependent on the filesystem, but, to provide maximum durability, mda-rs fsyncs both the file and the containing directory by default. This does incur a performance penalty, so I have also included an option to use the file-sync-only approach if preferred.

Since mda-rs was written to cater primarily to my personal needs, it currently has some limitations, or unimplemented functionality, if you will. The most prominent one is that it delivers to maildir only, since that's the only mailbox format I use. The second is that there is no built-in way to change the email data (e.g., add a header field) except by filtering through an external tool, since this is another feature I don't use.

Here is a small taste of how a custom MDA would look like with mda-rs:

use std::path::PathBuf;

use mda::{Email, EmailRegex, Result, DeliveryDurability};

fn main() -> Result<()> {
    let root = PathBuf::from("/tmp/my-personal-mail");

    let mut email = Email::from_stdin_filtered(&["/usr/bin/bogofilter", "-ep"])?;
    // Use quicker (but possibly less durable) delivery.
    email.set_delivery_durability(DeliveryDurability::FileSyncOnly);

    let from = email.header_field("From").unwrap_or("");
    let bogosity = email.header_field("X-Bogosity").unwrap_or("");

    if bogosity.contains("Spam, tests=bogofilter") ||
       from.contains("@banneddomain.com") {
        email.deliver_to_maildir(root.join("spam"))?;
        return Ok(());
    }

    let cc = email.header_field("Cc").unwrap_or("");
    let to = email.header_field("To").unwrap_or("");

    if to.contains("myworkemail@example.com") ||
       cc.contains("myworkemail@example.com") {
        if email.body().search("URGENCY RATING: (CRITICAL|URGENT)")? {
            email.deliver_to_maildir(root.join("inbox/myemail/urgent"))?;
        } else {
            email.deliver_to_maildir(root.join("inbox/myemail/normal"))?;
        }
        return Ok(());
    }

    email.deliver_to_maildir(root.join("inbox/unsorted"))?;

    Ok(())
}

and a corresponding minimal Cargo.toml:

[package]
name = "my-mda"
version = "0.1.0"
edition = "2018"

[dependencies]
mda = "0.1"

To provide an idea of the performance gains to expect, I benchmarked a us-ascii only version of my personal mail filtering rules on a sample of 250 of my recently received emails using all the aforementioned MDAs. Of course, the performance results will vary depending on both the rules and the email themselves, but in my experience what is presented below seems to be a typical outcome. I'd be very interested to know how mda-rs worked for others on the performance front.

In the benchmark I included runs both with and without filtering for spam, since such filtering takes up a significant amount of processing and affects the relative results. For mda-rs I included two versions, one that mostly uses per-header-field searches, and a second one that uses regular expressions exclusively, and thus is a bit closer to how the other MDAs work. For each mda-rs version I ran the benchmark both in file sync only mode, which is how the others MDAs work, and full sync (file and directory) mode, which is the default for mda-rs. Note that, for this benchmark, both maildrop and mda-rs performed MIME decoding and translation to UTF-8 (internally and automatically), whereas neither procmail nor fdm did so, and no external program was used to provide such functionality. I verified that the delivery results were the same for all MDAs.

mda-benchmark

mda-rs wins hands down when operating in file sync mode, while at the same time doing more work (normalizing the email) compared to the next faster contestant, procmail. In full sync mode, the extra directory syncs prove to be expensive enough to overshadow other mda-rs performance gains, and procmail takes the lead. Still, even in this case, the performance is acceptable and much better compared to both maildrop and fdm.

If you would like to learn more about how to use mda-rs, you can find detailed documentation at: https://docs.rs/mda

The mda-rs project repository is at: https://github.com/afrantzis/mda-rs

The mda-rs crates.io entry is at: https://crates.io/crates/mda

Enjoy!

1 See https://lwn.net/Articles/457667/

by Alexandros Frantzis at October 07, 2019 12:00 AM

September 25, 2019

Andrew Shadura

Rust-like enums in Kotlin

Rust has an exciting concept of enumeration types, which is much more powerful than enums in other languages. Notably C has the weakest type of enum, since there’s no type checking of any kind, and enum values can be used interchangeably with integers:

enum JobState {
    PENDING,
    STARTED,
    FAILED,
    COMPLETED
};

You can opt for manually assigning integers instead of leaving this to the compiler, but that’s about it.

Higher level languages like Python and Java treat enumeration types as classes, bringing stricted type checking and better flexibility, since they can be extended nearly as any other classes. In both Python and Java individual enumerated values are singleton instances of the enumeration class.

class JobState(Enum):
    PENDING = auto()
    STARTED = auto()
    FAILED = auto()
    COMPLETED = auto()
enum JobState {
    PENDING,
    STARTED,
    FAILED,
    COMPLETED;
}

Since enumerations are classes, they can define extra methods, but because the enum values are singletons, they can’t be coupled with any extra data, and no new instances of the enum class can be created.

In contrast with Python and Java, Rust allows attaching data to enumerations:

enum JobState {
    Pending,
    Started,
    Failed(String),
    Completed
}

This allows us to store the error message in the same value as the job state, without having to declare a structure with an extra field which would be used only when the state in Failed.

So, what Kotlin has to offer? Kotlin has a language feature called sealed classes. A sealed class is an abstract class with limited interitance: all of its subclasses have to be declated in the same file. In a way, this is quite close to the Rust enums, even though sealed classed look and behave a bit differently.

sealed class JobState {
    object Pending : JobState()
    object Started : JobState()
    object Completed : JobState()
    data class Failed(val errorMessage: String) : JobState()
}

Declared this way, JobState can be used in a way similar to Rust’s enums: a single variable of this type can be assigned singletons Pending, Started or Completed, or any instance of Failed with a mandatory String member:

val state: JobState = JobState.Failed("I/O error")

when (state) {
    is JobState.Completed ->
        println("Job completed")
    is JobState.Failed ->
        println("Job failed with an error: ${state.errorMessage}")
}

This usage resembles the regular Java/Kotlin enums quite a bit, but alternatively, Pending and friends can be declared outside of the sealed class, allowing them to be used directly without the need to add a JobState qualifier.

A slightly simplified real life example from a Kotlin project I’m working on, where a separate coroutine handles I/O with a Bluetooth or a USB device:

sealed class Result
object Connected : Result()
data class Failed(val error: String) : Result()

sealed class CommServiceMsg
data class Connect(val response: CompletableDeferred<Result>) : CommServiceMsg()
object Disconnect : CommServiceMsg()
data class Write(val data: ByteArray) : CommServiceMsg()

fun CoroutineScope.bluetoothServiceActor(device: BluetoothDevice) = actor<CommServiceMsg>(Dispatchers.IO) {
    val socket: BluetoothSocket = device.createSocket()

    process@ for (msg in channel) {
        when (msg) {
            is Connect -> {
                with(socket) {
                    msg.response.complete(try {
                        connect()
                        Connected
                    } catch (e: IOException) {
                        val error = e.message ?: ""
                        Failed(error)
                    }
                }
            }
            is Disconnect -> break@process
            is Write -> {
                socket.outputStream.write(msg.data)
            }
        }
    }
    socket.outputStream.flush()
    socket.close()
}

Here, we can talk to bluetoothServiceActor using messages each carrying extra data; if the coroutine needs to talk back (in this example, the result of a connection attempt), it uses a CompletableDeferred<> value of the Result type, which can hold an error message when needed.

With that in place, we can write something like this:

val bluetoothService = bluetoothServiceActor(device)
val response = CompletableDeferred<Result>()

bluetoothService.send(Connect(response))
var result = response.await()
when (result) {
    is Connected -> {
        bluetoothService.send(Write(byteArrayOf(42, 0x1e, 0x17)))
        bluetoothService.send(Disconnect)
    }
    is Failed ->
        println("error occurred: ${result.error}")
}

by Andrej Shadura at September 25, 2019 01:31 PM

July 25, 2019

Erik Faye-Lund

Zink: Summer Update and SIGGRAPH 2019

Here’s an overview of the recent changes in Zink since the previous post, as well as an exciting announcement!

What’s new in the world of Zink?

OK, so I haven’t been as good at making frequent updates on as I was hoping, but let’s try to make up for it:

Since last time, there’s quite a lot of things that has been resolved:

  • We now do proper control-flow. This means things like if-statements, for-loops etc. There might be some control-flow primitives missing still, but that’s because I haven’t encountered any use yet.
  • Alpha testing has been implemented.
  • Client-defined clip-planes has been implemented.
  • Support for gl_FrontFacing has been implemented.
  • Lowering of glPointSize() to gl_PointSize has been implemented. This means you can use glPointSize() to specify sizes instead of having to write the gl_PointSize-output from the vertex shader.
  • Support for gl_FragDepth has been implemented.
  • Two-sided lighting has been implemented.
  • Shadow-samplers has been implemented.
  • Support for 8-bit primitive indices has been implemented.
  • Occlusion queries has been implemented correctly across command buffers. This includes the ability to pause / restore queries.
  • The compiler has been ported to C.
  • The compiler no longer lowers IO, but instead process derefs directly.
  • The compiler now handles booleans properly. It’s no longer lowering them to floats.
  • David Airlie has contributed lowering of glShadeModel(GL_FLAT) to flat interpolation qualifiers. This still doesn’t give us the right result, because Vulkan only supports the first vertex as the provoking vertex, and OpenGL defaults to the last one. To resolve this in a reasonable way, we need to inject a geometry shader that reorders the vertices, but this hasn’t been implemented yet. You can read more about this in this ticket
  • … and a boat-load of smaller fixes. There’s just too many to mention, really.

In addition to this, there’s been a pretty significant rewrite, changing the overall design of Zink. The reason for this, was that I made some early design-mistakes, and after having piled a bit too many features on top of this, I decided that it would be better to get the fundamentals right first.

Sadly, not all features have been brought forward since the rewrite, so we’re currently back to OpenGL 2.1 support. Fixing this is on my list of things I want to do, but I suspect that cleaning things up and upstreaming will take presedence over OpenGL 3.0 support.

And just to give you something neat to look at, here’s Blender running using Zink:

Blender on Zink Blender on Zink

Khronos Vulkan BoF at SIGGRAPH 2019

Khronos has been nice enough to give me a slot in their Vulkan Sessions at the Khronos BoFs during SIGGRAPH 2019!

The talk will be a slightly less tech-heavy introduction to Zink, what it does and what the future holds. It will focus more on the motivation and use cases than the underlying technical difficulties compared to my previous talks.

So, if you’re in the area please drop by! A conference pass is not required to attend the BoF, as it’s not part of the official SIGGRAPH program.

by Erik Faye-Lund at July 25, 2019 07:38 PM

July 11, 2019

Helen Koike

Como começar a contribuir com Open Source

Collabora is participating in the Linux Developer Conference Brazil and this post is written in Portuguese to serve as a guide for the attendees to learn more about how to start contributing to Open Source Software.

Muita gente pensa que para começar a contribuir com um projeto de FOSS (Free and Open Source Software) tem que saber codar, isso é um mito, as pessoas precisam conhecer como o projeto é estruturado como uma comunidade, e muitas vezes para contribuir nem é necessário saber escrever código, pois reportar bugs, realizar testes, contribuir com o design da interface, revisar e arquivar bug reports obsoletos, traduzir o software ou ajudar a organizar times ou conferência são contribuições muito bem-vindas.

Compilando o software a partir do código-fonte

Geralmente o projeto possui alguma página web com diversas informações, inclusive com instruções de como fazer o download do código-fonte. A maioria dos projetos usam algum sistema de controle de versão, atualmente o Git é mais popular, mas pode ser Svn, Cvs, Mercurial e outros. Entre no site do projeto desejado, verifique qual sistema é usado e familiarize-se com as ferramentas necessárias.

Todo projeto é diferente, mas provavelmente você irá encontrar alguns dos seguintes arquivos na base do projeto:

  • README (txt): Contém explicações iniciais do projeto. Comece por aqui, já que usualmente esse arquivo possui informações de como compilar e instalar o software do código-fonte.
  • LICENSE ou COPYING (txt): Possui informações sobre a licença na qual o projeto é distribuído.
  • MAINTAINERS (txt): Descreve quais pessoas são responsáveis por qual parte do projeto.
  • CONTRIBUTING (txt): Descreve mais informações de como participar da comunidade e contribuir.
  • Documents ou docs (pasta): Contém diversas documentações sobre o projeto, tanto de usabilidade quanto sobre a parte técnica.

Quando nos referimos ao projeto "mainline" ou "upstream", significa que é o projeto oficial, onde o desenvolvimento de novas funcionalidades está acontecendo e o que contém as modificações mais recentes. Em geral, ao fazer alguma modificação (patch) no código, ela só será considerada oficial depois de entrar na versão mainline. Por exemplo, o browser que vem em uma distribuição de Linux não é a mainline. Apesar de se basear em uma versão específica do projeto mainline do Linux, a comunidade da distribuição usualmente aplica diversas modificações tanto no código-fonte quanto nas configurações do kernel para atender as necessidades específicas daquela comunidade . Portanto, ao encontrar um bug, é importante testar o código mainline para ver se ele já foi corrigido ou se afeta apenas a versão da sua distribuição.

Como buscar ajuda

Ao buscar ajuda, tenha em mente que a maioria das pessoas da comunidade também são voluntárias como você e não são obrigadas a atender às suas demandas, logo seja educado, verifique se o projeto define algum tipo de código de conduta, mas não tenha medo de perguntar, mostre que você fez uma rápida investigação, isso indica que você está correndo atrás e as pessoas em geral gostam de incentivar gente nova e interessada, por exemplo:

"Olá, eu sou novo no projeto, queria entender sobre X, achei o artigo Y mas ele não parece explicar o que eu gostaria de saber, alguém poderia me explicar ou me indicar onde posso ver essa informação?"

"Olá, estou tentando entender como o código X funciona, me parece que faz a tarefa Y mas estou incerta, existe alguma documentação sobre isso? Procurei e não encontrei. Agradeço se alguém me ajudar"

Se você não tiver certeza que está perguntando na lista de email certa, ou no canal de IRC certo, pergunte onde seria mais apropriado postar a sua pergunta. Caso você não obtenha resposta, não assuma que está sendo ignorado por ter começado agora, as pessoas são ocupadas, espere um pouco (algumas horas no IRC ou uma semana no email) e refaça a pergunta.

Onde buscar ajuda

IRC: Muitos projetos possuem um canal de bate-papo no IRC para a comunidade se coordenar e se ajudar. Veja se o seu projeto possui um canal, baixe um cliente de IRC, conecte no servidor e se junte ao canal. Por exemplo, no servidor da Freenode você pode encontrar os canais #freebsd, #ubuntu, #debian, #python, #docker, há também canais mais específicos, por exemplo, o Debian se organiza por times, logo você pode encontrar no servidor da OFTC os canais #debian-cloud, #debian-mirrors, #debian-ftp entre outros. Muitas vezes o projeto possui canais específicos para quem está começando, como o #kernelnewbies na OFTC.

Guia para configurar o seu IRC https://fedoramagazine.org/beginners-guide-irc/

Listas de emails / fóruns: Procure se o projeto possui alguma lista de email ou fórum para discussão, por exemplo, o Kernel possui uma lista para cada subsistema. Procure a lista apropriada e se inscreva, muitas listas disponibilizam os archives dos emails passados, útil quando está procurando sobre algum tópico que já foi discutido. A dica aqui é fazer bottom post (responder emails em baixo ou entre a cópia) utilizado pela maioria dos projetos. Caso não obtenha resposta em uma ou duas semanas, verifique se mandou a sua pergunta para a lista de email mais apropriada ou as vezes as pessoas estão simplesmente ocupadas, eu geralmente respondo a mesma thread the email com a palavra "ping" para relembrar as pessoas de responderem.

Discussões em algum sistema: alguns projetos usam o GitHub diretamente para perguntas e discussões, verifique se o projeto usa algum sistema específico para discussões e participe.

Pull requests

Pull request é quando você requisita que suas mudanças seja incluído na mainline. Cada projeto possui a sua maneira de enviar modificações (patches) de código para o projeto, no Linux Kernel por exemplo, você deve mandar os patches no texto do email no formato do git-format-patch, já no FreeBSD, você deve anexar o patch em formato diff unified no sistema de controle de bugs, alguns outros projetos aceitam pull requests pelo sistema do GitHub, verifique com o seu projeto como você deve enviar os patches para a comunidade.

Estrutura da comunidade

Cada comunidade se organiza de uma forma diferente, podemos encontrar os diferentes papeis dentro da comunidade

  • Autor: quem começou o projeto
  • Commiter: quem possui o acesso de commit na mainline
  • Mantenedor: o responsável por revisar e aplicar patches de alguma subparte do projeto ou no projeto todo
  • Colaboradores: que ajudam o projeto em diversos aspectos
  • Time: um subgrupo de colaboradores que fazem alguma tarefa específica do projeto, podendo até fazer o papel de um mantenedor
  • Usuários

É importante conhecer a estrutura da comunidade para saber pra quem fazer perguntas, pedir revisões ou mandar contribuições para o time ou grupo de pessoas trabalhando na área relacionada. Lista de emails ou canais de IRC com escopo muito genérico será mais difícil encontrar alguém que revise e aplique um patch, ou responda uma pergunta muito específica sobre algum assunto.

Exemplos de como algumas comunidades funcionam

Debian:

A comunidade é organizada de maneira bem democrática, o líder do projeto é eleito por voto anual, os trabalhos são divididos por times (Ex. time de mirrors, time DSA para a infraestrutura, time de release que coordena o lançamento da próxima versão), e cada pacote no Debian pode ter como responsável um mantenedor específico ou um time. Logo ao encontrar um bug em um determinado pacote, verifique quem é o responsável, entre em contato e envie seus patches para a pessoa, time ou lista de email certa.

Linux Kernel:

O projeto é mantido por Git, o único commiter da mainline é o Linus Torvalds, o projeto é dividido em diversos subsistemas, cada subsistema possui um mantenedor em que o Linus Torvalds confia e aceita seus pull requests. A organização do desenvolvimento de cada subsistema é bem variado e cada um tem suas regras, tem subsistemas que possuem co-mantenedores e outros que não, cada subsistema normalmente tem um canal de IRC e uma lista de email e documentação.

Aprofundando no código

A maioria das pessoas começam contribuindo com algo tão simples quanto corrigir um erro ortográfico, essa simples contribuição trará conhecimento do workflow completo de como trabalhar com a comunidade, mas muitas vezes encontrar problemas técnicos à serem resolvidos nem sempre é fácil e exige um conhecimento maior do projeto.

Ao se aprofundar no código, verifique quais são os métodos de debug que o projeto utiliza, essas técnicas vai ajudá-lo a entender melhor o código e à informar com mais detalhes o seu problema para outras pessoas. Pesquise onde você consegue visualizar os logs de erro, como incluir no código alguma mensagem de log, veja se consegue executar o projeto passo à passo com ferramentas como GDB, Python Trace. Alguns projetos já possuem testes inclusos, veja também se a comunidade usa alguma ferramenta externa para teste, aprenda como reproduzir os testes e à depurar o código.

Achar um problema a ser resolvido

Caso você tenha encontrado um mal funcionamento no projeto de interesse, comece por aí, verifique se alguém já reportou o bug em alguma lista de email, fórum ou no próprio sistema de controle de bugs, entre em contato com as pessoas envolvidas e peça mais informações. Caso não saiba por onde começar a olhar no código, pergunte às listas de email ou canais de IRC, normalmente as pessoas te apontarão para onde olhar a grosso modo, e assim comece a sua investigação do problema. Reporte o bug para a comunidade para que saibam que o problema já está sendo investigado e que podem te contactar para trabalhar em conjunto, evitando assim retrabalho.

Caso você não tenha dado a "sorte" de encontrar um bug, muitos projetos já possuem uma lista de bugs conhecidos só esperando alguém para adotá-los, procure onde está lista se encontra, analise algum bug que consiga reproduzir e não tenha medo de fazer perguntas.

Dependendo do projeto, muitas vezes dar os passos acima é muito complicado e exige muito conhecimento prévio para entender um bug, no Linux Kernel por exemplo, essa lista de problemas já conhecidos mal existe, as que existem só possuem problemas difíceis para um iniciante. O que eu sugiro nesse caso é que você mude a abordagem, ao invés de tentar achar um problema, estude o código, quando você estiver familiarizado o suficiente vai provavelmente visualizar que o código não é perfeito e ver vários pontos de melhorias. Uma dica é pegar algum código (alguma função, classe, módulo ou driver do projeto) e tente reescrever esse código do zero, utilizando o código original apenas como referência, fazendo perguntas para a comunidade das partes que não entende. O conhecimento adquirido neste exercício vai proporcionar uma visão melhor do código, das APIs internas, expor possíveis problemas, além de integrá-lo melhor na comunidade.

Estágios pagos com mentoria

Uma ótima forma de começar à contribuir com FOSS é através de um estágio direcionado. Há algumas empresas ou fundações de FOSS que financiam programas de estágio remoto de aproximadamente 3 meses, onde o mentor é normalmente um voluntário que propõe uma determinada tarefa dentro do projeto. Com isso você já tem uma direção no que contribuir, ter alguém que você possa fazer perguntas e assumir que é sim o papel delas te responder, acompanhar o seu progresso periodicamente, além de ser pago por isso.

Google Summer of Code (GSoC): Estágio remoto em algum projeto de FOSS pago pela Google durante 3 meses de Maio à Julho para estudantes, confira quais projetos participam, se interessar por algum, verifique as propostas feitas pelos mentores voluntários, veja o processo de seleção, normalmente há algumas tarefas que você precisa realizar na aplicação.

Outreachy: Organizado pela Software Freedom Conservancy, similar ao GSoC para grupos sub-representados na comunidade, não precisa ser estudante, acontece duas vezes ao ano (Maio à Junho, e Dezembro à Fevereiro).

Endless Vacation of Code (EVoC): A Fundação X.org tem o próprio programa pra universitários que querem começar a contribuir. O EVoC pode começar em qualquer mês do ano.

Conferências

Muitos projetos de FOSS organizam conferências para reunirem a comunidade e discutirem problemas atuais de forma colaborativa. Ir à conferências é uma ótima forma de se familiarizar com o projeto e conhecer pessoalmente as pessoas com quem você interage online. Verifique quais são as conferências que o projeto no qual você se interessa realiza, ou quais as principais conferências que as pessoas que você interage participa.

Ajuda de custo para conferências

O problema é que a maioria dessas conferências são fora do Brasil e a viagem fica muito cara, principalmente para estudantes. Felizmente, muita dessas conferências distribuem bolsas para ajuda de custo, a Linux Foundation por exemplo disponibiliza um formulário para requisitar ajuda de custo com passagem de avião e hotel, também há ajuda para grupos sub-representados para incentivar a diversidade na comunidade, e as vezes o próprio projeto possui algum fundo para bolsa. O Debian por exemplo, pode pagar a sua viagem, principalmente se você é uma pessoa que já está ajudando a comunidade, mas mesmos novatos podem conseguir.

Outra forma de conseguir ajuda de custo é se voluntariar par ajudar na organização da conferência, mande um email para a equipe de organização e pergunte se há essa possibilidade.


Espero que essas dicas ajudem, caso tenha alguma dúvida entre em contato ou deixe um comentário

by Helen M. Koike Fornazier (noreply@blogger.com) at July 11, 2019 02:01 AM

June 26, 2019

Alyssa Rosenzweig

GNOME Meets Panfrost

In my last Panfrost blog, I announced my internship goal: improve Panfrost to run GNOME3. GNOME is a popular Linux desktop making heavy use of OpenGL; to use GNOME with only free and open-source software on a machine with Mali graphics, Panfrost is necessary.

Two months ahead-of-schedule, here I am, drafting this blog post from GNOME on my laptop running Panfrost!

A tiled architecture

Bring-up of GNOME required improving the driver’s robustness and performance, focused on Mali’s tiled architecture. Typically found in mobile devices, tiling GPU architectures divide the screen into many small tiles, like a kitchen floor, rendering each tile separately. This allows for unique optimizations but also poses unique challenges.

One natural question is: how big should tiles be? If the tiles are too big, there’s no point to tiling, but if the tiles are too small, the GPU will repeat unnecessary work. Mali offers a hybrid answer: allow lots of different sizes! Mali’s technique of “hierarchical tiling” allows the GPU to use tiles as small as 16x16 pixels all the way up to 2048x2048 pixels. This “sliding scale” allows different types of content to be optimized in different ways. The tiling needs of a 3D game like SuperTuxKart are different from those of a user interface like GNOME Shell, so this technique gets us the best of both worlds!

Although primarily handled in hardware, hierarchical tiling is configured by the driver; I researched this configuration mechanism in order to understand it and improve our configuration with respect to performance and memory usage.

Tiled architectures additionally present an optimization opportunity: if the driver can figure out a priori which 16x16 tiles will definitely not change, those tiles can be culled from rendering entirely, saving both read and write bandwidth. As a conceptual example, if the GPU composites your entire desktop while you’re writing an email, there’s no need to re-render your web browser in the other window, since that hasn’t changed. I implemented an initial version of this optimization in Panfrost, accumulating the scissor state across draws within a frame, rendering only to the largest bounding box of the scissors. This optimization is particularly helpful for desktop composition, ideally improving performance on workloads like GNOME, Sway, and Weston.

…Of course, theory aside, mostly what GNOME needed was a good, old-fashioned bugfixing spree, because the answer is always under your nose. Turns out what really broke the desktop was a trivial bug in the viewport specification code. Alas.

Scoreboarding

Looking forward to sophisticated workloads as this open driver matures, I researched job “scoreboarding”. For some background, the Mali hardware divides a frame into many small “jobs”. For instance, a “vertex job” executes a vertex shader; a “tiler job” executes tiling (sorting geometry job into tiles at varying hierarchy levels). Many of these jobs have to execute in a specific order; for instance, geometry has to be output by a vertex job before a tiler job can read that geometry. Previously, these relationships were hard-coded into the driver, which was okay for simple workloads but does not scale well. 

I have since replaced this code with an elegant dependency management system, based on the hardware’s scoreboarding. Instead of hard-coding relationships, the driver can now specify high level dependencies, and a generic algorithm (based on toplogical sorting) works out the order of submission and scoreboard flags necessary to actualize the given requirements. The new scoreboarding implementation has enabled new features, like rasterizer discard, to be implemented with ease.

With these improvements and more, several new features have landed in the driver, fixing hundreds of failing dEQP tests since my last blog post, bringing us nearer to conformance on OpenGL ES 2.0 and beyond.

Originally posted on Collabora’s blog

June 26, 2019 04:00 AM

June 05, 2019

Alyssa Rosenzweig

Joining Collabora for a summer of Panfrost

Hello, I’m Alyssa Rosenzweig, a student, the lead developer of the open-source Panfrost graphics driver… and now a Collaboran!

Years ago, I joined the open-source community with a passion and a mission: to enable equal access to high-quality computing via open-source software. With this mission, I co-founded Panfrost, aiming to create an open-source driver for the Mali GPU. Before Panfrost, users of Mali GPUs required a proprietary blob, restricting their ability to use their machines as they saw fit. Some users were unable to run Linux, their operating system of choice, with the display system of their choosing, simply because there were not blobs available for their particular configuration. Others wished to use an upstream kernel; yet others held a deep philosophical belief in free and open-source software. To each users’ driver problem, Panfrost seeks to provide a solution.

Days ago, I joined Collabora with the same passion and the same mission. Collabora was founded on an “open first” model, sharing my personal open source conviction. Collabora’s long-term vision is to let open-source software blossom throughout computing, fulfilling my own dream of an open-source utopia.

With respect to graphics, Collabora has shared my concerns. After all, we’re all on “Team Open Source” together! Collabora’s partners make awesome technology, often containing a Mali GPU, and they need equally awesome graphics drivers to power their products and empower their users. Our partners and our users asked, and Panfrost answered.

At Collabora, I am now a full-time Software Engineering Intern, continuing throughout the summer to work on Panfrost. I’m working alongside other veteran Panfrost contributors like Collaboran Tomeu Vizoso, united with open-source community members like Ryan Houdek. My focus will be improving Panfrost’s OpenGL ES 2.0 userspace, to deliver a better experience to Panfrost users. By the end of the summer, we aim to bring the driver to near conformance, to close any performance gaps, and through this work, to get GNOME Shell working fluidly on supported Mali hardware with only upstream, open-source software!

Supporting GNOME in Panfrost is a task of epic proportions, a project dream since day #1, yet ever distant in the horizon. But at Collabora, we’re always up for the challenge.

Originally posted on Collabora’s blog

June 05, 2019 04:00 AM

April 20, 2019

Alexandros Frantzis

Dynamic programming for fun and profit

Imagine that you decide to make an investment. You have 1720 currency units (CU) at your disposal and the investment options that are available are:

Option#min amountmax amountinterestsign up bonus
150991%0.5
21002991.1%1
33004991.6%2.2
45009991.8%2.5
5100019992%4

You may use multiple investment options, each one possibly multiple times. Each option accepts a minimum and maximum amount of CUs, and provides an interest on that amount plus an additional bonus amount which does not depend on the invested amount. How will you invest your money? Take your time and continue reading when ready...

Let see how you did! If you invested all 1720 CU in option 5 then... you chose poorly. Your profit is going to be 38.4 CU — not bad, but you can do better. If you invested 1020 in option 5 and 700 in option 4, that's a bit better at 39.5 CU, but still not optimal. What about 1020 in option 5, 500 in option 4 and 200 in option 2? That's actually a bit worse at 39.1 CU! I'll spare you the agony and reveal the best solution: 1020 in option 1, 300 in option 3 twice, and 100 in option 2, giving a profit of 40.5 CU. This is about 5.5% better than the straightforward solution of assigning everything to option 5.

As this short exercise has hopefully exhibited, finding the optimal portofolio of options is far from trivial. Is there a way to programmatically find the optimal solution?

The road to dynamic programming

First, let's define what are trying to find. A portofolio is a multiset of ( o,n ) pairs, each pair representing that we are assigning n CU to option o . For this solution to be acceptable, all the pairs need to be valid and the total sum of all pairs must not exceed the total amount of money we want to use. We are trying to find the portofolio which provides the maximum profit.

Brute force

To understand the solution space a bit more let's devise a way to enumerate all the portofolios, which is in fact not trivial.

One way to think about this is that to produce a portofolio we first have to split our total amount in some way, and assign an option to each element in the split. Each such split is formally called a partition of n . The number of partitions for a number n is given by the partition function p ( n ) which becomes spectacular very quickly. For each such partition we also need to factor in all the different acceptable option permutations.

Another way to think about this, which turns out to be more helpful, is with a recursive, constructive view of the problem. Let P n be the set of portfolios for n total CU. From this set, the subset of porfolios that use option o with k CU consists of all portofolios with n k total CU combined with the option ( o,k ) . Expressing this with set notation we have ( is the multiset sum operator, which in our case is used to add items to the multiset):

P n ( o,k )= { Q { ( o,k ) }| Q P n k }

The whole set P n is the union of all P n ( o,k ) sets for all valid combinations of o and k :

O k = { Options that accept k }
P n = 1 k n o O k P n ( o,k )

Using this recursive set description of the solution space we can write a program to enumerate all the solutions for a particular amount and set of options. This description provides a straightforward, albeit rather inefficient, way to solve this problem: by brute force — evaluate all solutions and use the best. In pseudocode this would look like:

solve_brute_force(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                for s in (solve_brute_force(n-i) ∪ {∅}):
                    solutions = solutions ∪ (s ⊎ {(o,k)})

    return solutions

solve_driver(n):
    return argmax(solve_brute_force(n), profit)

Optimal substructure

Let's see if this problem has any interesting properties we can take advantage of to find the optimal solution more efficiently.

If we look at P n ( o,k ) we see that we are combining ( o,k ) with all portofolios for n k , and we ultimately check all these combinations to find the best solution. This is wasteful, however, since the choice of ( o,k ) doesn't affect the profit of any of the members of P n k . To put it more formally, the profit function is an additive multiset function:

Pr of it ( P n ( o,k ))= Pr of it ( Q { ( o,k ) } )= Pr of it ( Q )+ Pr of it (( o,k )) , Q P n k

We can thus get the best total value by using one of the members of the P n ( o,k ) set with the highest profit. Any other choice from that set would yield a lower total profit. We have thus shown that our problem exhibits the optimal substructure property: an optimal solution includes optimal solutions to subproblems.

Armed with this revelation we can now produce a much more efficient formulation:

P n ( o,k )= Q { ( o,k ) } ,Q is any one el ement of ar g max S P n k Pr of it ( S )

This leads us to the following pseudocode:

solve_optimal_substructure(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪
                            (solve_optimal_substructure(n-i) ⊎
                             {(o,k)})

    return argmax(solutions, profit)

solve_driver(n):
    return solve_optimal_substructure(n)

Overlapping subproblems

Another interesting thing to note is that to solve for n we need to solve for n 1 .. . 1 . Similarly, to solve for n 1 we need to solve for n 2 .. . 1 and so on. We are solving the same problems over and over again! Our problem thus exhibits the property of overlapping subproblems. We can take advantage of this property by storing results in a cache and reusing them instead of recalculating them.

Updating our original pseudocode to use a cache we have:

solve_overlapping_subproblems(n):
    if n in CACHE: return CACHE[n]

    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                for s in (solve_overlapping_subproblems(n-i) ∪ {∅}):
                    solutions = solutions ∪ (s ⊎ {(o,k)})

    CACHE[n] = solutions
    return CACHE[n]

solve_driver(n):
    return argmax(solve_overlapping_subproblems(n), profit)

Dynamic programming

If we combine both optimizations, taking advantage of both the optimal substructure property and the overlapping subproblems property, we reach the dynamic programming solution:

solve_dp_recursive(n):
    if n in CACHE: return CACHE[n]
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪
                            (solve_dp_recursive(n-i) ⊎ {(o,k)})

    CACHE[n] = argmax(solutions, profit)
    return CACHE[n]

solve_driver(n):
    return solve_dp_recursive(n)

Alternatively, we can express this without recursion:

solve_dp_iterative(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪ (CACHE[n-i] ⊎ {(o,k)})

    return argmax(solutions, profit)

solve_driver(n):
    for i in [1..n]:
        CACHE[i] = solve_dp_iterative(i)
    return CACHE[n]

Scaling down

Instances of this problem with a large n can still be prohibitive to solve even when using the dynamic programming approach I just described. Not all is lost, however, if we are willing to relax our requirement for a guaranteed optimal solution.

One approach to cutting down our solution time, while still producing a possibly good solution, is to scale down our problem. We do this by dividing the start, end and bonus values of all options, and also the target number for which we solve for by a particular scale factor. Another way to view this is that scaling down by scale corresponds to performing the algorithms with a minimum step of value scale instead of value 1.

For example if we scaled down our original problem by 10 we would solve for 172 (= 1720 / 10) and use the following scaled down options:

Option#min amountmax amountinterestsign up bonus
1591%0.05
210291.1%0.1
330491.6%0.22
450991.8%0.25
51001992%0.4

When we find the optimal solution we can scale it back up using the same scale factor.

The expectation is that the optimal solution for the scaled down version of our problem is going to be in the proximity of the optimal solution for the original version, and assuming that the Profit function is generally smooth, the profit is also going to be near the optimal one.

Getting greedy

Another approach to solving this problem faster is to explore whether there is greedy solution for it. Below I will describe an approach that works in many cases, but is not guaranteed to provide the optimal solution.

Taking inspiration from the greedy solution to the fractional knapsack problem, at each step we greedily select to put as many CUs as possible in the option with the best marginal profit. The key observation here is that in our case each option has two different marginal profits. The first one involves investing in a new instance of an option using its minimum acceptable amount. The second one involves investing more in an instance of an option we have already invested in. In the first case the sign up bonus kicks in and increases the marginal profit. In the second case the marginal profit is simply the interest.

For our original scenario we have:

Option#min amountmax amountmin marginal profitadd marginal profit
150992%1%
21002992.1%1.1%
33004992.3333%1.6%
45009992.3%1.8%
5100019992.4%2%

For a total amount of 1720 this method works flawlessly. We first select to add ( o 5 , 1000) to our portofolio for a marginal profit of 2.4%. From the remaining 720 we add ( o 3 , 300) for a marginal profit of 2.333%. From the remaining 420 we again choose ( o 3 , 300) . We now have 120 left, for which we choose ( o 3 , 100) , and the final 20 we add to the ( o 5 , 1000) instance we already have.

Unfortunately, this method doesn't find the optimal solution in all cases. Take for example a total amount of 500. The greedy algorithm chooses ( o 3 , 300) , ( o 2 , 100) , ( o 2 , 100) for a total profit of 11.2. Alas, the optimal solution is simply ( o 4 , 500) for a total profit of 11.4.

Perhaps there is some other greedy approach that provides optimal solutions in all cases. Still, the described greedy approach is useful if we want to quickly get a possibly good solution. We could also try both the greedy approach and the scaling approach to raise our confidence level in the generated solution.

Can we do better?

We have mostly focused on solving our investment problem using a dynamic programming approach. Taking a step back, we may ask if this is indeed the best way to get an optimal solution. Can we approach this problem from a completely different angle to improve performance?

It turns out that the answer is yes! With some creativity our problem can be expressed as an integer linear program (ILP), for which we have algorithms that work quite efficiently in practice. I plan to present this approach in more detail in a future post.

An ounce of action...

If you want to try out the ideas mentioned in this post, I have created a program implementing all the algorithms and variations. You can find it at: https://github.com/afrantzis/invest.

Enjoy!

by Alexandros Frantzis at April 20, 2019 03:00 PM

April 16, 2019

Jeremy Whiting

Uninitialized member variables

Dear lazyweb,

In the past week or so I've been bitten twice by failing to initialize member variables in a C++ class's constructor. So I went looking for compiler options, static analyzers, etc. to tell me when I fail to do this. So far I've found nothing that correctly reports to me that I forgot to add m_foobar initialization to my constructor. /Wall on msvc -Weff-c++, cppcheck, etc. all fail me here. Isn't there something out there that will say "Jeremy, you dork, you forgot to initialize m_startCount as 0, you'll get garbage" (sometimes and only on M$ Windows, but still) ?

by Jeremy Whiting (noreply@blogger.com) at April 16, 2019 03:55 PM

April 01, 2019

Alyssa Rosenzweig

Kodi and SuperTuxKart on Panfrost

Back in October, Panfrost ran some simple benchmarks, like glmark. Five months later, Panfrost has grown from running benchmarks to real-world apps, like Kodi, and 3D games like SuperTuxKart and Neverball.

Since the previous post, there have been major improvements across every part of the aspect culminating in this milestone. On the kernel side, my co-contributors Tomeu Vizoso and Rob Herring have created a modern kernel driver, suitable for mainline inclusion. Panfrost now uses this upstream-friendly driver, rather than relying on a modified legacy kernel as in the past. The new kernel module is currently under review for mainline inclusion. You can read more about this progress on Tomeu’s blog.

Outside the kernel, however, the changes have been no less significant. Early development was constrained to our own project repositories, as the code was not yet ready to general users. In early February, thanks in part to the progress on the kernel-space, we flew out of our nest, and Panfrost was merged into upstream Mesa, the central repository for free software graphics drivers. Development now occurs in-tree in Mesa.

We have continued decoding new aspects of the hardware and implementing support in the driver. A few miscellaneous additions include cube maps, gl_PointSize and gl_PointCoord, linear depth rendering, performance counters, and new shader instructions.

One area of particular improvement has been our understanding of the hardware formats (like “4-element vector of 32-bit floats” or “single 16-bit unsigned normalized integer”). In Panfrost’s early days, we knew magic numbers to distinguish a few of the most common formats, but the underlying meanings of the voodoo patterns were elusive. Further, the format bits for textures and attributes were not unified, further hindering the diversity of supported formats available. However, Connor Abbott and I have since identified the underlying meaning of the format codes for textures, attributes, and framebuffers. This new understanding allows for the magic numbers to be replaced by a streamlined format selection routine, mapping Gallium’s formats to the hardware’s and supporting the full spectrum of formats required for a conformant driver. Panfrost is now passing texture format tests for OpenGL ES 2.0.

From a performance standpoint, various optimizations have been added. In particular, a fast path likely relating to the “tiler” in the hardware was discovered. When this fast path is used, performance on geometry heavy scenes skyrockets. In one extreme demo (shading the Stanford bunny), performance more than tripled, and these gains trickle down to real-world games.

Features aside, one of the key issues with an early driver is the brittleness and instability. Accordingly, to guarantee robustness, I now test with the drawElements Quality Program (dEQP), which includes comprehensive code correctness tests. Although we’re still a while away from conformance, I now systematically step through identified issues and resolved the bugs, translating to fixes across every aspect of the driver.

One real-world benefactor of these fixes is the Kodi media center, which today works well using Panfrost to achieve a fluid interface on Midgard devices. For standalone installations of Kodi, today there are experimental images featuring Kodi and Panfrost. To further improve fluidity, Kodi and Panfrost can even interoperate with the video decoding acceleration, contingent on cooperative kernel drivers.

For users more inclined to gaming, some 3D games are beginning to show signs of life with Panfrost. For instance, the classic (OpenGL ES 2.0) backend of the ever-popular kart racing game, SuperTuxKart, now renders with some minor glitches with Panfrost. Performance is playable on simple tracks, though we have many opportunities for optimization. To bring up this racing game, I added support for complex control flow in the compiler. Traditionally, control flow is discouraged in graphics, due to the architecture of desktop GPUs (thread “warps”). However, Midgard does not feature these traditional optimizations, negating the performance penalty for branching from control flow. The implementation required new bookkeeping in the compiler, as well as an investigation into long jumps due to the size of the game’s “uber-shader”. In total, this compiler improvement – paired with assorted bug fixes – allows SuperTuxKart to run.

Likewise, Neverball is playable (and fun!) with Panfrost, although there are rendering anomalies relating to the currently unimplemented legacy feature “point sprites”. In contrast to Kodi and SuperTuxKart, which make liberal use of custom shaders, Neverball is implemented with purely fixed-function desktop OpenGL. This poses an interesting challenge, as Midgard is designed specifically for embedded graphics; the blob does not support this desktop superset. But that’s no reason we can’t!

Like most modern free software OpenGL drivers, Panfrost is built atop the modular “Gallium” architecture. This architecture abstracts away interface details, like desktop versus embedded OpenGL, normalizing differences to allow drivers to focus on the hardware itself. This abstraction means that by implementing Panfrost as an embedded driver atop Gallium, we get a partial desktop OpenGL implementation “free”.

Of course, there is functionality in the desktop superset that does not exist in the embedded profile. While Gallium tries to paper over these differences, the driver is required to implement features like point sprites and alpha testing to expose the corresponding desktop functions. So, the bring-up of desktop OpenGL applications like Neverball has led me to implement some of this additional functionality. Translating the “alpha test” to a conditional discard instruction in the fragment shader works. Similarly, translating “point sprites” to the modern equivalent, gl_PointCoord, is planned.

Interestingly, the hardware does support some functionality only available through the full desktop profile. It is unknown how many “hidden features” of this type are supported; as the blob does not appear to use them, these features were discovered purely by accident on our part. For instance, in addition to the familiar set of “points, lines, and triangles”, Midgard can natively render quadrilaterals and polygons. The existence of this feature will suggested by the corresponding performance counters, and the driver-side mechanics were determined by manual bruteforce of the primitive selection bits. Nevertheless, now that these bonus features are understood, quads can be drawn from desktop applications without first translating to indexed triangles in software. Similarly, it appears in addition to the embedded standard of boolean occlusion queries, setting a chicken bit enable the hardware’s hidden support for precise occlusion counters, a useful desktop feature.

Going forward, although the implementation of OpenGL ES 2.0 is approaching feature-completeness, we will continue to polish the driver, guided by dEQP. Orthogonal to conformance, further optimization to improve performance and lower memory usage is on the roadmap.

It’s incredible to reflect back and realise just one year ago, work had not even begun on writing a real OpenGL driver. Yet here we are today with an increasingly usable, exclusive free software, hardware-accelerated desktop with Mali Midgard graphics.

Frost on.

April 01, 2019 04:00 AM