Friday, March 24, 2017

On Linux vDSO and clock_gettime sometimes being slow

Like the previous post on this somewhat dormant blog, I want to share an oddity I discovered that no search engine could really find for me - even though once I found what the problem was, it turns out I was by no means the first person to discover this.

Some system calls that are used extremely frequently in Linux can be speeded up by a mechanism called vDSO: a virtual dynamically linked shared object. In this way, the kernel can publish selected functions that can run straight in userspace. This means a regular program dynamically links in bits of kernel supplied code, which in turn means that there is no overhead to "jump into the kernel" to execute code. All good.

One way you notice your system call has received the vDSO treatment is that "strace" and friends no longer see it, since there actually is no system call anymore.

Of specific interest are time related calls, like gettimeofday and clock_gettime. Many programs make a ton of these calls, and little can be done to prevent it. You might want to cache the current time perhaps, but to do so, you'd need to know the time. So quite some code relies on time related system calls being really really fast.

This explains why the recent discovery that the AWS platform does not vDSO gettimeofday was such a big deal.

Within PowerDNS software (dnsdist), we use clock_gettime() in hopes of getting the kind of timer we want, and also one that is fast and cheap for the kernel to provide. While doing "million QPS" scale benchmarking of dnsdist today, we did a strace to find out what dnsdist was doing, and lo, within there we found millions and millions of system calls to clock_gettime(). Help!

My first thought was that the platform we were on might perhaps not actually support clock_gettime as vDSO. To figure out what is actually in the kernel supplied vDSO, I used a program called dump-vdso.c that can be found strewn across the web. This emits the library on stdout, and we can then run the regular objdump tool on it to get:

$ ./dump-vdso >
$ objdump -T     file format elf64-x86-64

0000000000000418 l    d  .rodata 0000000000000000              .rodata
0000000000000a20  w   DF .text 0000000000000305  LINUX_2.6   clock_gettime
0000000000000000 g    DO *ABS* 0000000000000000  LINUX_2.6   LINUX_2.6
0000000000000d30 g    DF .text 00000000000001b1  LINUX_2.6   __vdso_gettimeofday
0000000000000f10 g    DF .text 0000000000000029  LINUX_2.6   __vdso_getcpu
0000000000000d30  w   DF .text 00000000000001b1  LINUX_2.6   gettimeofday
0000000000000ef0  w   DF .text 0000000000000015  LINUX_2.6   time
0000000000000f10  w   DF .text 0000000000000029  LINUX_2.6   getcpu
0000000000000a20 g    DF .text 0000000000000305  LINUX_2.6   __vdso_clock_gettime
0000000000000ef0 g    DF .text 0000000000000015  LINUX_2.6   __vdso_time

From this we see that clock_gettime is in fact in there. So why was it not getting used? I donned the protective gear and the spelunking equipment and entered the caves of glibc, where I found several nested files, each #including a file from a parent directory, in an impressive attempt to abstract out per CPU, per OS and C library logic. I stared at that code for what felt like a long time, but it appeared to check lots of things, to eventually always end up calling __vdso_clock_gettime(). Weird.

I then headed to __vdso_clock_gettime() in the Linux kernel where things finally became clear. It turns out the vdso code ITSELF will generate an actual system call for many timers you can request. In fact, this happens for all cases except CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE (as of Linux 3.13 up to 4.11-rc3).

So that solved the mystery: the vDSO stuff was working, but it was itself causing an old fashioned system call. Perhaps the other timers are too difficult (or perhaps even impossible) to supply from the userspace context.

Now that I knew what the problems was, I found lots of other places noting  issues with clock_gettime() performance, for example here and there, and other people have written some harsh words about CLOCK_MONOTONIC_RAW that we attempted to use.

It is my hope that the next person to run into this will find this blogpost before spending half a day learning about vDSO. Good luck!

Thursday, May 19, 2016

Brief note on LuaWrapper and unexpected crashes in destructor & destructor ordering

PowerDNS products rely heavily on Lua, and we mostly use the most excellent LuaWrapper to seamlessly share data and code between Lua and C++. LuaWrapper is such a fundamental part of our products that we took over maintenance of LuaWrapper when the genius original author Pierre Krieger moved on to other things.

Yesterday we traced a weird crash when we fixed a memory leak in the PowerDNS Recursor, and I think I got lucky in finding the cause quite quickly. This could have taken days. My usual technique of searching the web for other people with similar crashes failed.

So this post is here to create something for your search engine to find. If you destroy the LuaWrapper object and afterwards see crashes in lua_pushlightuserdata(), what is going on is that any objects you copied from Lua are being destroyed *after* the LuaWrapper instance itself got destroyed.

This means that those objects are trying to deregister themselves with a Lua instance.. that is no longer there.

If you have a struct like this:

struct Somestuff
   typedef std::function < bool(std::shared_ptr < DNSQuestion > ) > luacall_t; 
   lfunc_t d_preresolve;
   LuaContext d_lw;
And within your code you did:

d_preresolve = d_lw.readVariable("preresolve");

You will get a crash because when your Somestuff instance gets detroyed, destructors run in reverse order from bottom to top. So d_lw is gone, and only THEN does preresolve get destroyed, at which point it tries to deregister with a Lua that is no longer there.

This is all trivially resolved by putting LuaContext at the very top of your object, and everything that depends on it below.

The C++ standard is explicit about the order in which destructors get called, so this is safe.

I hope this was helpful!

Friday, October 23, 2015

How to do fast canonical ordering of domain names

A small post to document an arcane subject: how to quickly do a comparison of DNS names in canonical order. First, to recap, what is DNS canonical ordering? It is case insensitive, but 8 bit, based on the labels that make up the DNS name in reverse order.

So, in human order, and, sort like this:

But in DNS canonical order, they sort like this:
This is because in the canonical order, we look at the 'com', 'de' and 'net' parts first. And only if those are equal, we look at the second-to-last label in the name. RFC 4034 section 6 has all the details. DNS ordering is more than an obscure subject: you need to order your records this way to calculate DNSSEC signatures for example. If you get the ordering wrong, your signatures won't match.

So how can we do the comparison quickly? The naive way is of course to use one of your language primitives to split up a domain name in labels, reverse the order, and do a case insensitive lexicographical comparison on them. That could look like this in C++ 2011:
 auto ours=getRawLabels(), rhsLabels = rhs.getRawLabels();
 return std::lexicographical_compare(ours.rbegin(), ours.rend(), rhsLabels.rbegin(), rhsLabels.rend(), CIStringCompare());
While this is easy enough, it is also astoundingly slow since it splits up your domain name and does loads of allocations. Loading a 1.4 million record long zone into a container with canonical ordering this way took 40 seconds. Ordering based on naive case insensitive human compare loaded in 8 seconds.

Now, DNS names consist of labels with a length, so typically gets stored in a packet as the value 3, then "www," the value 8, then "powerdns", the value 3 and them "com". Note there are no dots in there!

It is highly recommended to also store DNS names as a series of length/label-content pairs, since otherwise you need to do lots of escaping to deal with embedded nulls, embedded . etc.

When stored like this however, it is not straightforward to do a canonical compare. So I asked around among our open source friends, and Marek Vavrusa of CZNIC quickly chimed in to explain how the excellent Knot nameserver products do it, and it is quite clever. First, you store the domain in reverse label order, so would turn into com.powerdns.www, which would normally look like "3com8powerdns3www" in memory.

However, if you naively compare 3com8powerdns3www with (say) 2de8powerdns3www, you'd decide that based on the '3' versus the '2', that would sort before, which is wrong.

So the clever bit is to zero out the label length fields, so you store the names as '0com0powerdns0www' and '0de0powerdns0www'. And then you can simply do a case-insensitive compare and get the right ordering. And if course there is no need to store the leading 0 in this case.

Now, there is a downside to this arrangement: you lose the information what the domain actually looked like. If there were embedded 0s in the domain name, and there could be, you can't recover the domain name anymore. However, if you don't care, or if you just use this as a key and have a copy of the original domain name somewhere, this works great. Thanks for the explanation Marek!

PowerDNS uses the most astoundingly great Boost Multi Index container. I had the great pleasure of meeting its author Joaquín Mª López Muñoz recently and I keep learning more about what is possible with this wonderful container. But, Boost Multi Index allows us to index objects based on what is in them, without a key that lives separately. So within PowerDNS we like to just use the DNSName that is embedded in an indexed object to sort, and we don't want to 0 out the label lengths in there.

After some trial and mostly error, I hit on the following rapid ordering procedure for DNS names stored in DNS native format (so: 3www8powerdns3com).

  1. Scan through both labels and note the positions of the label boundaries in a stack-based simple array (so no malloc). Store how many labels each DNS name has.
  2. Starting at the last position in your arrays, which denotes the beginning of the last label, do a lexicographical compare starting at one position beyond the length byte and ending "length byte" bytes after that. This for both DNS names
  3. If this comparison leads to 'smaller', your DNS name is definitely smaller. If it leads to 'larger', your DNS name is definitely not smaller and you are done. 
  4. Otherwise, proceed one place back in the array of lengths of both names.
  5. If you ended up at position 0 for one name and not yet for the other, that name is smaller and you are done
  6. If you ended up at position 0 for BOTH DNS names, none is smaller than the other
  7. Go to 2 (except don't look at the last label, but at the 'current' position).
To make this safe, either make two arrays sufficiently large that no legal DNS name could overflow it, or use something plausible as a maximum, and fall back to allocating on the heap if your name is long enough to warrant it.

In code, this looks something like this. So, the big question of course, is it fast enough? After a little bit of tuning, the canonical comparison function implemented as above is just as fast as the 'naive' human order comparison. Loading a zone again takes 8 seconds. This is faster than you'd expect, but it turns out our canonical comparison function inlines better since the original version secretly used a C library function for case insensitive comparisons. 

I hope this has been helpful - either do what Knot does, if you can get away with it, and then it is super fast, or ponder our suggested stack based array solution.

Good luck!

Monday, August 10, 2015

Startups don’t win RFPs: here’s why you might want to do one anyhow

(I’d like to thank Dirk Peeters who taught me most things RFP and Remco van Mook for commenting on and improving this post.)

As I find myself in RFP-land again, I found myself pondering how my previous startups spent tremendous amounts of time working on these Requests For Proposals from huge customers. Enough so to warrant a blog post that may be helpful for current startups: how to choose between small customers and large customers, like governments and telcos with procurement departments.

Now, there are of course rare startups that sell straight to consumers, and their game is different. These typically are the startups everyone knows about, because they deal with customers (you) directly. But chances are your startup either sells to businesses, or will need to reach the consumer through established distributors or vendors that embed your stuff into their product. If you plan to sell to consumers directly, this post is not for you.

For context, most startups eventually have a ‘minimum viable product’ (MVP), or at least something that strives to be that. Product/market fit has not been achieved, let alone a perfect match. In other words, customers may be wanting things you have not gotten round to offering, or did not know they wanted. And meanwhile you added lots of stuff that the perceive as excess baggage. Not only is the product not perfect, neither is your knowledge of the market. There may not BE a market yet!

At this stage, there should be good contact with potential customers already. There are lots of small ones to talk to and far fewer very large ones. Who should you spend your time on?

Everyone discovers early that lofty revenue/profit goals will not be achieved with smaller customers in a reasonable timeframe. If you need thousands of business engagements to get to where you want to be, a startup-sized salesforce is not going to get you there in a reasonable timeframe. In fact, the salesforce that could make this happen does not want to work with you. Good salespeople work with companies with established products so they know they can make quotum (READ THIS LINK, by the way).

So when a potential ‘whale’ of a customer comes along, it is tempting to jump on that and give it all you got. And here is where I want to warn you. Large corporations and governments typically ‘tender’ deals. They don’t just pick a vendor, test their stuff, and make the deal. Instead, they write out a very confusing and conflicting list of requirements and instructions, and send that to any interested parties. Which may include you!

A typical RFP-process includes a spreadsheet filled with hundreds or thousands of numbered requirements, a set of documents outlining the procedure, and a number of questions like ‘outline security architecture of the product’, ‘provide copy of your sustainable sourcing policy’, and ‘describe in detail how the system deals with errors’. Next up might be a clarification meeting, where you can ask questions about the requirements and procedure. You then send in the huge stack of requested documents, after which you might be invited to present your company in person. This is then followed by interminable rounds of negotiations, references, proof of concept sessions etc.

Now, if your product is struggling to find a market (and at the beginning, it WILL be), this sure feels like traction! We’re getting somewhere, we have a potential customer, they have requirements, we can try to meet them, we have to show up for presentations etc. It almost feels like the real thing!

In my startups, I have wasted MONTHS on these processes. Turns out however, startups don’t win RFPs. Not until the word ‘startup’ starts feel wrong for your (by now) serious company.

Then, there is this

So why don’t you win an RFP as a small startup? For one, there is the kind of company that inflicts the RFP-process on itself. These are not dynamic places. These are not the organisations that want to give a startup a chance. That’s why they do an RFP, to make sure nothing is bought where they don’t have it in (credible) writing that the product will do what it promises. You mostly sell to the procurement department, not the actual user. And no matter how fab your product or service, the procurement department sees only risk in your startup. For one they will try to check your financials for the past three years. You have not HAD three years!

The second reason you don’t win is that an RFP is a compendium of every requirement someone ever voiced in the company. Hundreds of them at least. And this strongly favours incumbent vendors who have had years or decades to add every such feature under the sun, if it makes sense or not. The deck is stacked against you.

This third reason you don’t win an RFP is that is is typically heavily lobbied by existing relationships, making sure that only one vendor qualifies, or that new challengers (you) are immediately disqualified from the process because you don’t have 100 staff, over 10 million in annual revenue or 5 years of profitable business behind you.

In addition, the RFP process is highly depressing:

  • Seeing a list of features you don’t have and won’t have anytime soon is painful
  • Many requirements are in fact nonsense (‘system MUST be redundant against power failures’ - customer is trying to procure software!) - which makes you wonder about the state of the world
  • Finding out you didn’t actually have a chance because you are a startup is a blow

But the siren song of the RFP is still tempting for the business to business startup since it sure feels like progress and traction! It may be hard to resist if no other actual sales are going on. So, here are some reasons why it might make sense to participate in an RFP anyhow:

  • You get a free list of competitor features! Most of them show up as requirements (see the lobbying above)
    • In general, the process is great ‘competitive intelligence’ - although this works both ways, your competition often learns about you too!
  • The whole process is very educational about how large customers think and operate, something most startup employees have little experience with
  • Attempting to meet the giant list of requirements is a great motivator for your development team, finally something concrete to aim for
  • The documents requested in an RFP might come in handy anyhow, like that ‘high-level overview of your architecture’. And with pressure, such documents get written a lot faster
  • It gives your salespeople something to do except moan about lack of traction, although the flip side of this is that they waste their time on the RFP and don’t get any actual business done
  • Frequently, a whole RFP process fails (very frequently by the way, much more than you’d think), and if you managed to make a great impression, you might get invited to the ‘afterparty’ and do business anyhow

But always manage the process carefully. Taking part in such a large process can swallow all the time and resources of a small startup, and in the end you might have little to show for it. Be sure to drop out on time when it isn’t working. It’s better to lose quickly than not to win slowly. And in any case don’t neglect the rest of your business as the process goes on! Also, do realize that even if you send in a compliant RFP response, it still only sits in the sales pipeline. It is not a purchase order.

Finally, there is the risk that you might actually win! And that is the point where all those ‘FULLY COMPLIANT’s you optimistically put in the spreadsheet come back to haunt you. You don’t get paid until you are actually fully compliant! That and the potential huge size of the deal that could well overwhelm your startup.

So getting back to the beginning of the post, the smaller customers that don’t fit with your lofty long term revenue goals. Well, they are your path to the market. For one, because they themselves are smaller, don’t feel bad about doing business with small companies. In fact, when a small company tries to do business with a huge one, they feel they don’t get the attention they deserve.

Also, because no (formal) procurement department sits in between, if you find a small customer with strange requirements, you can talk to the people with the actual requirement  and figure out what they mean, or convince them to drop it.

This does not mean your initial customer should necessarily be tiny. They might even be pretty large, as long as they are still procuring things ‘humanly’, and not by spreadsheet with macros that prevent you from entering explanations (not making this up). Your first goal is to get ANY revenue - it will help you sustain your business or help show (current and future) investors that you really are moving the needle.

Once you’ve established yourself through several approachable launching customers, you might start winning RFPs. And it still won’t be fun, but it will get you to your financial targets.

Good luck!

Wednesday, July 15, 2015

Developing open source: don't listen to the people that want you to live under a bridge

This post is for open source developers and all other people working on open source, while trying to make a living. You may also get referred to this page if you made an open source developer unhappy with your demands.

Let's start at the very beginning: authoring, documenting, packaging and supporting any software takes stupendous amounts of time. Quality requires serious, dedicated and sustained effort.

Secondly, we need to realise that people also need to eat, live somewhere, get health insurance and often eventually raise and support a family. It would also be great if they saved for their retirement.

Combining these two, doing a non-trivial open source project requires more than 'evening hours and weekend work'. It requires people dedicated to the task. But they also need to make money to live!

And eventually this collides with some folks' expectations of open source. It turns out you generally can't live on charitable donations, and I'm not even sure if you should - donations come and go, and they may also come with expectations that are contrary to those of your actual users. Most large open source projects will therefore need to make money the traditional way: by actually selling something.

There are loads of things you can sell. Consulting, support, new features (even open source, people will pay to get the features they need), training, training materials, value added services, perhaps even some non-free software on top of the stuff you give away. But no matter how you do it: if you ask money for things, some people who can't or won't pay get left out in the cold.

And this frequently leads to anger. People will accuse you of selling out, and this hurts. They may even mention you are stealing from the community. Oddly enough they will also threaten to stop using your software! And all of this because you try to make a living so you can provide this great open source stuff for free.

So here's my word of encouragement: there is a segment of the open source community that you will never appease. They won't be satisfied until you live under a bridge, sell your body by day so you can code by night. For free.

You won't ever make these people happy. Whatever you do, it will not be free enough, and you should always do more. Their threat to stop using your software should tell you everything you need to know about them. Finally also realise that more often than not, the very people that accuse you of selling out work for horrible companies that would not DREAM of committing anything back to the community!

So separate out these people that want you to live on the streets from the parts of the community you should be listening to. They will help you guide the complex and challenging landscape of 'making money with open source'. But if you try to make wrong people happy, you'll fail and you and your software will end up badly.

(On a side note - open source is a community, not just business. There is no need to sell all or even the majority of your work. Some things are just a great idea, and you should add them to your software. Also ponder, users that need other stuff from you might "pay" you in Q&A, documentation work, (performance) testing etc. So don't get me wrong - this post is about making a living, not about asking money for everything!)

Summarising: making money with open source is ok, because delivering quality for a non-trivial project costs time, and that time can't come after dinner when you are tired from your day job. It should be a real job, and that requires income. Don't feel bad about it and don't try to make the wrong people happy. Do listen to the rest, as there is real tension between open source and making money, and they can guide you.

Good luck!

Saturday, June 6, 2015

Design of DelayPipe, a small class to let you execute millions of things in the near future

When developing, quite frequently we run into the situation where we want to delay an action a little bit, perhaps for half a second. If your software is not highly concurrent, this is as easy as a call to one of the more granular sleep() calls now available.

But what if you have a user facing 1000 DNS queries per second, wanting to delay answering all of them by a few hundred milliseconds? We can't spawn 1000 threads per second and just run nanosleep() on them!

This problem is all the more vexing because many end-users find it very easy to say the words 'just delay the answers a bit!', without realizing that this is in fact not an easy thing to do at high concurrency.

This week, we saw Pavel Odintsov run into this exact issue, nameservers from a big search engine were flooding him with questions, and Pavel was looking for solutions that would not involve actually blocking the search engine. We suggested adding a delay in answering queries, since this frequently shuts up 'back to back' generators of questions.

But then we actually had to do it. PowerDNS has a nice office these days, and it takes me 30 minutes of cycling to get there, and I find those 30 minutes are well suited to speccing out interesting solutions to problems.

What I came up with has now been implemented. It does have downsides, but it is remarkably simple and handles millions of delayed events per second with ease.

First, we employ a trick I documented back in 2007, namely using a pipe within the same process to transmit pointers. This gives you 'free' communication between threads, and since you pass pointers over the pipe, there is no need to serialize anything. We use the pipe to send the a tuple containing the event & when it needs to be executed.

The nice thing too is that pipes guarantee that small writes are atomic. So we can have as many threads as we want put events in the pipe, without any locking on our side, and they will arrive at the worker thread uninterleaved. And no locking (in our code at least) is always a win!

As an aside, we've been using the pipe trick since 2007 or so, and it has never shown up in any benchmarks (except in a positive fashion). However, people frequently associate pipes with "slow text based command line monstrosities", and assume pipes must be slow. In fact, pipes are core operating system infrastructure, and they are blazing fast.

Getting back to our 'DelayPipe', the worker thread reads pointers from the pipe, and puts them in a time ordered container. Subsequently, there is the question of how to 'wait until the first event that needs to happen'. And again, we can reuse some POSIX semantics, as follows.

We don't immediately read from the pipe, but we call poll() on it first with a timeout, and that timeout is equal to the amount of time we must wait until the first event needs to be executed. And if there is no work already queued, we wait infinitely long. For safety, we also check if there is actual work in the queue that is overdue for happening and in that case we don't read from the pipe at all. Once poll() either tells us we have a new event, or a timeout happened, we execute all events that have reached their appointed moment.

With this simple setup, we get a solution that is thread-safe, since many threads can put things in the pipe simultaneously. We do have only one executor of delayed events, which might be unacceptable if the execution of events is actually slow. However, we could easily spawn more of these threads, and give them their own pipe.

A potential downside is that this setup malloc()s stuff in a producer thread and free()s them in the worker thread, something that is known to be a heavily locked and potentially slow operation in many mallocs. Benchmarking on recent platforms however has not shown any appreciable overhead because of this.

Finally, after only a 30 minute trip on my bike, and half a day of coding, we were able to get this up and running:

The actual code can be found here and here. The testcase shows how it works in practice. I hope this has been useful for you!

PS: before you cry "you don't know how deep the pipe is!", please do realize we don't actually use the pipe for queueing, we have separate container for that. The pipe is only a nice way to get easy producer/consumer communications between threads with sane semantics.

Thursday, March 5, 2015

Some notes on shared_ptr atomicity and sharing configuration state

At PowerDNS, we've frequently run into this problem: a program has a complicated amount of state and configuration which determines how queries are processed, which happens non-stop. Meanwhile, occasionally we need to change this configuration, while everything is running.

The naive solution to this problem is to have a state which we access under a read/write lock. The state can in that case only be changed if no thread holds a read lock on it. This has at least two downsides. For one, locks aren't free. Even if they don't involve system calls, atomic operations cause inter-CPU communications and cache evictions. Secondly, if the worker threads hog the read lock (which they may need to do for consistency purposes), we can't guarantee that updates happen in a reasonable timeframe.

Effectively this means that a change in configuration might take a very long time, while we incur overhead every time we access the configuration, even if it isn't changing.

A very very tempting solution is to keep the configuration in a shared_ptr, and that threads access the configuration through this shared_ptr. This would give us unlocked access to a consistent configuration. And, if we read the C++ 2011 standard, it looks like this could work. It talks about how std::shared_ptr is thread safe under various scenarios. Simultaneously, the standard defines atomic update functions (, which are sadly unimplemented in many modern compilers. This is a hint.

So here is what one would hope would work:
if(!g_config->acl.check(clientIP)) dropPacket();
Where the global g_config would be something like shared_ptr<Config>. If the user updates the ACL, we would do this to propagate it:
auto newConfig = make_shared<Config>(*g_config); newConfig->acl=newACL; g_config=newConfig;
And we would fervently hope that that the last statement was atomic in nature, so that a user of g_config either gets the old copy, or the new copy, but never anything else. And this would be right at least 999999 out of 1 million cases. And on that other case we crash. I know cause I wrote a testcase for this this afternoon.

It turns out that internally, a shared_ptr consists of reference counts and the actual object. And sadly, when we assign to a shared_ptr, the reference counts and the object get assigned to separately, sequentially. And a user of g_config above might thus end up with a shared_ptr in an inconsistent state that way.

By tweaking things a little bit, for example by utilizing swap(), you can increase the success rate of this mode of coding to the point where it fails almost almost never. You could fool yourself you solved the problem. Over at PowerDNS we thought that too, but then suddenly CPUs and compilers change, and it starts breaking again, leading to hard to debug crashes.

So, to summarise, whatever the C++ 2011 standard may or may not say about shared_ptr, as it stands in 2015, you can't atomically change a shared_ptr instance while someone tries to use it.

And of course we could add an RW-lock to our every use of g_config, but that would get us back to where we started, with heavy locking on everything we do.

Now, in general this problem (infrequent updates, non-stop access) is very well known, as is the solution: Read Copy Update. I'm not a big fan of software patents (to say the least), but I'll lovingly make an exception for RCU. IBM released the patent for use in GPL-licensed software, and unlike most patents, this one doesn't only prohibit other people from doing things, RCU also tells you exactly how to do it well. And RCU is sufficiently non-obvious that you actually need that help to do it well.

Now, the full glory of RCU may be a bit much, but it turns out we can very easily get most of its benefits:

  • Lock the g_config shared_ptr before changing it (this can be a simple mutex, not even an RW one, although it helps) 
  • Have the threads make a copy of this g_config ten times per second, fully locked. 
  • The threads actually only access this (private) copy
This means that if the configuration is changed, the operational threads will continue with the old configuration for at most 0.1 second. It also means that no matter how staggering the overhead of a lock is, we incur it only ten times per second. Furthermore, since the lock is only held very briefly for a copy, the updates will also happen very quickly.

In this way, we don't rely on unimplemented atomic shared_ptr functions, but we do get all the benefits of almost completely unlocked operations. 

UPDATE: Many people have pointed out that instead of "10 times per second", do the update if an atomic "generational" global counter no longer matches the local one. But some potential synchronisation issues linger in that case (you might miss a second very rapid change, for example. So while interesting, we do lose simplicity in this case.

UPDATE has the code for this idea

Summarising: don't attempt to rely on potential shared_ptr atomic update behaviour, but infrequently copy it it, but frequently enough that changes in configuration propagate swiftly, but not so frequently that the locking overhead matters.

Enjoy! And if you know about the implementation plans and status of the atomic_load etc family of functions for shared_ptr in the various popular compilers, please let me know!

UPDATE: Maik Zumstrull found this thread about the atomic shared_ptr operations in gcc.