yshui --log-level=trace

As the title suggests, this is a dump of my random thoughts. Well, that is the intention at least. I have just started so it is still pretty barren here.

Socials


License: All articles and materials on this site, unless otherwise specified, are published under CC BY 4.0. Icons by Font Awesome, font by Open Sans.

I found an 8 years old Xorg bug

Let me set the right expectations first. This bug I found is actually not that complicated, it's very straightforward once you see what's going on. But I still think the process it took me to uncover this bug could be interesting. It's also kind of interesting that a simple bug stayed undiscovered for so long. I will speculate why that is later. Now let's start.

The big X server lock

To give you some background, I was working on picom, the X11 compositor, when I encountered this problem. picom utilizes a X command, called GrabServer, which is essentially a giant lock that locks the entire X server.

Why do we need this? Well, that's because the X server is a terrible database, but that would take a long article to explain (let me know if you would like to read about that). To put it simply, picom needs to fetch the window tree from X. But there is no way to get the whole tree in one go, so we have to do this piece by piece. If the window tree keeps changing as we are fetching it, we will just get horribly confused. So we lock the server, then fetch the tree in peace.

And GrabServer is just the tool for that, quoting the X protocol specification:

[GrabServer] disables processing of requests and close-downs on all connections other than the one this request arrived on.

Cool, until I found out that ...

... It doesn't work

I have a habit of putting assertions everywhere in my code. This way, if something is not going where I expected it to go, I will know. I would hate for things to quietly keep going and only fail mysteriously much later.

And that is how I found out something isn't right - windows that we know exist, suddenly disappear while we are holding the X server lock. Basically when a window is created, we receive an event. After getting that event, we lock the X server, then ask it about the new window. And sometimes, the window is just not there. How could this happen if the server is locked by us?

The first thing I did was to check the protocol again. Did I somehow misunderstood it? Unlikely, as the protocol is pretty clear about what GrabServer does. OK, does picom have a bug then? Did we somehow forget to lock the server? Did we miss a window destroyed event? I checked everywhere, and didn't really find anything.

This seems to lead to a single possible conclusion ...

A Xorg bug?

It could be, though I didn't want to jump to conclusions that quickly. I want to at least figure out what was going on inside the X server when those windows were destroyed.

I could attach a debugger to the X server, however, debugging the X server pauses it, which would be a problem if I was debugging from inside that X session. Beside that, window destruction happens quite often, which can be prohibitive for manual debugging. It's still possible with a remote ssh connection, and gdb scripting, but it's inconvenient.

The other option is modifying the X server and adding printfs to to print out logs when interesting things happen. That still feels like too much work.

Luckily, there is a better way to do this. It's called eBPF and uprobe. Essentially they let you run arbitrary code when your target program reaches certain points in code, without requiring modifying the program, or disrupting its execution.

Yeah, we live in the future now.

So, I hooked into GrabServer, so I can who is currently grabbing the server; then I hooked into window destruction to print a stack trace every time a window is destroyed. When everything was ready I set it off and collected the logs. At first there were a couple of false positives, because some applications do legitimately grab the server and destroy windows. But after a while, I saw something that stood out:

0x4755a0 DeleteWindow (window.c:1071)
0x46ef75 FreeClientResources (resource.c:1146) | FreeClientResources (resource.c:1117)
0x4450bc CloseDownClient (dispatch.c:3549)
0x5bfd12 ospoll_wait (ospoll.c:643)
0x5b8901 WaitForSomething (WaitFor.c:208)
0x445bb5 Dispatch (dispatch.c:492)
0x44a1bb dix_main (main.c:274)
0x729a77b6010e __libc_start_call_main (:0)

Aha, CloseDownClient! So the window is closed because a client disconnected? But I remember the protocol specification says

... disables processing of requests and close-downs ...

Oh yeah, this is indeed a Xorg bug! So what's going on here?

A simple bug

Xorg server uses epoll to handle multiple client connections. When GrabServer is used, the server will stop listening for readability on all other clients besides the client that grabbed the server. This is all well and good, except for connection errors. When an error happens, epoll will notify the server even if it is not listening for anything. The epoll_ctl(2) man page says:

EPOLLERR

Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed.

epoll_wait(2) will always report for this event; it is not necessary to set it in events when calling epoll_ctl().

Turns out, it's just a simple misuse of epoll. Checking the git logs shows this bug has been there for at least 8 years.

So how does a simple bug like this slip under the radar for so long? Actually, I think I might have the answer for this.

You see, a X11 compositor sits in a very special niche in the system. Normal applications only care about their own windows most of the time, so they only need to synchronize within themselves. And for window managers, well, they manage windows. They have the authority to decide when a window should be destroyed (well, most of the time). So there is no race condition there either. Only the compositor needs to know about all windows, yet doesn't have a say on when they are closed. So it's in a unique position that made using the big X server lock necessary.

Besides that, this problem rarely happens despite picom's heavy use of the lock. I was only able to trigger it by installing .NET Framework on Linux using Wine. (I will not explain why I was doing that.)

Conclusion

I actually don't have much more to say. Hopefully you found this little story interesting. I definitely recommend learning about eBPF and uprobe. They are amazing tools, and have a lot more uses beyond just debugging.


Additional note 1: Despite me claiming it is necessary to use the server lock in picom, there might be a way of updating the window tree reliably without it. I do want to get rid of the lock if I can, but I am still trying to figure it out.

Did GitHub Copilot really increase my productivity?

Translations: 🇯🇵日本語


I had free access to GitHub Copilot for about a year, I used it, got used to it, and slowly started to take it for granted, until one day it was taken away. I had to re-adapt to a life without Copilot, but it also gave me a chance to look back at how I used Copilot, and reflect - had Copilot actually been helpful to me?

Copilot definitely feels a little bit magical when it works. It's like it plucked code straight from my brain and put it on the screen for me to accept. Without it, I find myself getting grumpy a lot more often when I need to write boilerplate code - "Ugh, Copilot would have done it for me!", and now I have to type it all out myself. That being said, the answer to my question above is a very definite "no, I am more productive without it". Let me explain.

Disclaimer! This article only talks about my own personal experiences, as you will be able to see, the kind of code I ask Copilot to write is probably a little bit atypical. Still, if you are contemplating if you should pay for Copilot, I hope this article can serve as a data point. Also, I want to acknowledge that generative AI is a hot-potato topic right now - Is it morally good? Is it infringing copyright? Is it fair that companies train their model on open source code then benefit from it? Which are all very very important problems. However please allow me to put all that aside for this article, and talk about productivity only.

OK, let me give you some background first. For reasons you can probably guess, I do not use Copilot for my day job. I use it for my own projects only, and nowadays most of my free time is spent on a singular project - picom, a X11 compositor, which I am a maintainer of. I am not sure how many people reading this will know what a "compositor" is. It really is a dying breed after all, given the fact X11 is pretty much at its end-of-life, and everyone is slowly but surely moving to wayland. Yes, each of the major desktop environments comes with its own compositor, but if you want something that is not attached to any DE, picom is pretty much the only option left. Which is to say, it is a somewhat "one of a kind" project.

Of course, as is the case with any software projects, you will be able to find many commonly seen components in picom: a logging system, string manipulation functions, sorting, etc. But how they all fit together in picom is pretty unique. As a consequence, large scale reasoning of the codebase with Copilot is out of the window. Since it has not seen a project like this during training, it's going to have a really hard time understanding what it's doing. Which means my usage of Copilot is mostly limited to writing boilerplates, repetitive code, etc. To give a concrete example, say you need to parse an escaped character:

if (pattern[offset] == '\\') {
	switch (pattern[offset + 1]) {
	case 't': *(output++) = '\t'; break;
	// ????
	}
}

If you put your cursor at the position indicated by ????, you can pretty reliably expect Copilot to write the rest of the code for you. Other examples include mapping enums to strings, write glue functions that have a common pattern, etc. In other words, the most simple and boring stuff. Which is very good. See, I am someone who wants programming to be fun, and writing these boring, repetitive code is the least fun part of programming for me. I am more than delighted to have someone (or rather, something) take it away from me.

So, what is wrong then? Why did I say I am more productive without Copilot? Well, that's because Copilot has two glaring problems:

1. Copilot is unpredictable

Copilot can be really really helpful when it gets things right, however, it's really difficult to predict what it will get right, and what it won't. After a year of working with Copilot, I would say I am better at that than when I first started using it, but I have yet to fully grasp all the intricacies. It is easy to fall into the trap of anthropomorphising Copilot, and trying to gauge its ability like you would a human. For instance, you might think, "Hmm, it was able to write that function based on my comments, so it must be able to write this too". But you are more than likely to be proven wrong by the chunk of gibberish Copilot throws at you. This is because, Artificial Intelligence is very much unlike Human Intelligence. The intuition you've developed through a lifetime's interaction with other humans, is not going to work with an AI. Which means, short of letting Copilot actually try, there is oftentimes no surefire way to know whether it's going to work or not. And this problem is compounded by the other big problem of Copilot:

2. Copilot is slooooow

clangd, my C language server of choice, is very fast. It's faster than I can type, which means practically speaking, its suggestions are instant. Even when the suggestions are unhelpful, it costs me nothing. I don't have to pause, or wait, so my flow is uninterrupted. Compared to that, Copilot is much much slower. I would wait at least 2~3 seconds to get any suggestion from Copilot. If Copilot decided, for whatever reason, to write a large chunk of code, it would take a lot longer. And in many instances I would wait all those seconds only to see Copilot spit out unusable code. And I would have to decide if I need to refine the instructions in comments and try again; or partially accept the suggestion and do the rest myself. Even though this doesn't happen that often, (after you have gotten to know Copilot a bit better), much time is wasted in the back-and-forth.


So yeah, that's pretty much all I have to say. At least at this very moment, I do not think Copilot will improve my productivity, so I definitely wouldn't be paying for it. If GitHub's plan was to give me a year's free access of Copilot to get me addicted, then their plot has conclusively failed. But that being said, if Copilot is a little bit smarter, or several times faster than it currently is, maybe the scale will tip the other way.

Hmm, should I be scared?

I want a different Nix

I have been daily driving NixOS for about six months, and it has been great. I don't think I'll ever switch to a different distro again (don't quote me on this). I'm sure you've already heard why nix is great many times, so I'll try not to parrot my fellow nix enthusiasts. (And if you have not, it's not hard to find such an article)

Instead, I am here to complain about one thing I dislike strongly about Nix: it does not support dynamic dependencies.

To see what I mean by this, let me give you some background first. With Nix, a package's dependency was fixed when it was built. Say you have this derivation (what Nix calls a package):

package = mkDerivation {
   # ...
   buildInputs = [ dep1 dep2 ];
};

Then after package is built, it will content hard coded references to dep1, dep2, which cannot be changed. If either of the dependencies changed, e.g. a version update, you will get a different package as output. This can be great if you want your packages to be absolutely deterministic and reproducible. But, as an average Linux user, this has caused me much pain.

Because of all the darn rebuilds!

In the example above, if anything depends on package, they will be rebuilt if either of package's dependencies changed, because package is an entirely different package now. And all the transitive dependencies will get rebuilt too! Which means if you want to install a slight variant of a package, you could be getting yourself into a rebuild hell. And because of your change, none of the packages that need rebuilding can be found in NixOS' binary cache.

Last week I spent more than an hour just to enable debug info for xorg.xorgserver, because Nix has to recompile the entirety of Qt, webkit2gtk, along with 100 other packages. And last time I tried to use a different version of xz (you might be able to guess why), Nix wanted to recompile literally everything, because xz is one of the bootstrap packages, so basically every other package depends on it.

And this is pretty hard for NixOS developers too. Changes to certain packages trigger huge rebuilds, which is so computationally intensive, NixOS developers choose to lump them together into big pull requests. And they often take weeks to be validated and merged. Even urgent security fixes have to get through the same pipeline.

This problem is intrinsic to Nix, so I don't think it can be solved. I just wish there is an alternative to Nix that does most of what Nix does but allows dynamic dependencies. If you know such a thing exists, please please let me know.