The Armijn Hemel portrait is a derivative of ” armijn (Armijn Hemel): Finding prior art to bust bad quality patents” by coscup used under CC BY.
I have been working on open source license compliance (mostly for embedded Linux) for almost 16 years and looked at hundreds of products and companies. Some companies I talked to realised that they had an open source license compliance problem and went out and bought a scanner tool, which they then scanned their code with. They then found out that running the scanner did not solve their problem and in actually made it worse because they were overwhelmed with the amount of information that the tools were giving them and didn’t know where to start. What went wrong?
Tools are incredibly helpful, but they can be counterproductive if you don’t know what problem you are actually trying to solve. I often ask my clients what problem they are trying to solve and I hear something like “we need to know all the licenses of the software that we use and we need a tool for that”. I could of course help them with just this task and be done with it, but I always try to dig deeper, as in almost all the cases it turns out that there is another issue that actually needs to be solved, for example “we are shipping a product and we got a letter from a copyright holder claiming we do not follow the license conditions” or “our customer has asked us to disclose what software is used and wants the complete and corresponding source code according to the license conditions”. While scanning source code is possibly related to it, it is definitely not all there is to compliance.
What I found is that the core question that people want to have answered is “where did this come from?” (provenance detection). Everything else (“what license is this code under?”, “who wrote the code?”, etcetera) will follow from answering this question. It is not always immediately clear where code comes from. For example, if a product is built on an open source framework such as the Yocto Project, then usually the code comes directly from dozens of open source projects. If a Linux-based product is built using a SDK (software development kit) procured from a chipset manufacturer, then it is very likely a mix of code coming from different origins, such as open source projects, or open source or proprietary code from ISVs and the chipset manufacturer itself, and so on. Depending on the chipset manufacturer it might be very clear, a complete mess or anything in between.
To find the provenance of code it is absolutely necessary to use tools, because it very quickly becomes undoable to do an analysis by hand. A few hundred files you can do by hand, but with packages the size of the Linux kernel it would take a very long time: if you would spend just a single second on each file in the Linux kernel it would take you more than 12 hours to go through all the files! There are various strategies that you could use. For example, you could use a checksum match to known (open source) files to see if the entire file is known, or use a snippet scanner (for snippets of code), or a combination thereof (like what FossID does) and then clear the results.
If you have ever done this type of work you know that clearing lots of files is not the most exciting thing to do. If you are like me you I am sure want to do something else instead. My advice: Don’t work hard, scan smart!
One approach that I found to be very effective for my clients is to use “delta scanning”: first (or only) looking at the code that differs from already known (open source) code. This allows me to quickly zoom in on problems because it filters the noise and typically there is a lot of noise! To give an example, if I would have a product based on Linux then I could of course scan the entire Linux kernel source code to see where it comes from. But I already know what the answer will be: “you use the Linux kernel”. So instead of verifying and confirming every result I will discard those files that I already know are in the upstream Linux kernel. Typically this means I reduce the problem space from over 60,000 files to about 150 to 200 files. This allows me to dive a lot deeper into the origins and see if anything strange is happening. Since well known open source components such as the Linux kernel have been looked at by thousands of people (and problems are caught very quickly in that process) I am confident about this approach.
I know that this approach is very effective, as it is what I use for the OSADL License Compliance Audit (https://www.osadl.org/License-Compliance-Audit.osadl-services-lca.0.html), where I have to do a (scoped) audit in a single day. The only problems that we ever found (like wrong licenses such as GPL3, non-commercial use only, or Apache 2) during the audits were files that were not in the upstream Linux kernel. Before you start implementing it (which I would recommend) I want to stress that I have only talked about provenance detection, not about other things that you might need to do for license compliance, such as extraction of copyright statements and license texts. But this approach will save you a lot of time to actually focus on that as well!
So to wrap up: Scanning in a smart way significantly reduces the amount of work you have to do, and it allows you to spend that time on the really problematic code and to be much more thorough.
With this post I would like to raise awareness about the NLnet Foundation (https://nlnet.nl) and their work in supporting organizations and people that contribute to an open information society. Please visit their homepage to learn more!