Analyzing Every Clojure Project on Github

Posted: September 6th, 2022
Update: September 7th, 2022. Added Failed Analyses section and fixed total repository count.

See source for full history.

Introduction

I'm generally interested in tools like cljdoc that work at the ecosystem level. As part of my work on dewey, which builds an index of all clojure libraries on github, I thought it would be a straightforward extension to statically analyze all the projects found. A bare, shallow checkout of every clojure project found by dewey1 is only about 14GB, which is a very tractable size.

There's also a couple of features of the clojure ecosystem that make it an interesting target for wholesale analysis:

  • libraries address a wide range of problems

  • libraries tend to be stable and concise

  • clojure syntax is regular and minimal

  • clojure has strong support and conventions for namespaces

There are several tools available for analysis. Projects like cljdoc and getclojure use dynamic analysis, but for this initial run, I chose to use clj-kondo's static analysis. Given a project, clj-kondo will report2:

  • namespace definitions and usages

  • locals and usages

  • var definitions and usages

  • keywords

  • protocol implementations

Analysis Rate: 87%

Out of the 13,274 repositories found by dewey on github, 11,573 (87%) were successfully3 analyzed.

Results

The main goal of this project is to make the resulting analyses available for other projects and tools to consume. The full analysis of projects can be found under the dewey releases in the analysis.edn.gz file.

In addition to providing the data, I also wanted to do some cursory investigation based on the analyses which I present below.

Hidden Gems

The clojure.core library is rich and I always seem to find useful nuggets that I've somehow missed. Just to make sure that I'm not the only person with clojure.core FOMO, I thought I'd share some hidden gems found among the least used public clojure.core vars. To be fair, many of these functions are recent additions, but not all of them!

pvalues and pcalls

I'll often use pmap for quick and dirty parallel computation. pvalues(15 usages) and pcalls(19 usages) also seem useful for quick and dirty parallelization.

(pvalues & exprs): Returns a lazy sequence of the values of the exprs, which are evaluated in parallel.
(pcalls & fns): Executes the no-arg fns in parallel, returning a lazy sequence of their values.

parse-*

The core library added a handful of useful parse-* functions in 1.11 like parse-double(26 usages), parse-boolean (6 usages), parse-long(202 usages), parse-uuid(25 usages). They do basically what you would expect. I usually search for the proper java interop call or go the lazy route and just use read-string. I'm happy to have a better alternative now.

iteration

Another gem is iteration (11 usages, once by dewey itself!) which was added in 1.11. The doc string doesn't immediately illuminate why or how to use it, but it's great. Check it out!4

min-key and max-key

Two functions that I think are criminally under used are min-key(256 usages) and max-key(431 usages). They tend to get overshadowed by their much more popular cousin, sort-by (4,926 usages), but are still useful in their own right.

(min-key k x)
(min-key k x y)
(min-key k x y & more)
Returns the x for which (k x), a number, is least. 
If there are multiple such xs, the last one is returned.

max-key is the same, but different.

random-uuid

random-uuid(60 usages) is another 1.11 addition and a welcome one.

Usage of clojure.core functions added in 1.10 and 1.11

There's always a delay between new clojure versions and projects getting around to adopting and leveraging the latest additions. I was curious which of the latest features were gaining traction.

* seq-to-map-for-destructuring5

Reference type usage

The results match my expectation that atom is, by far, the most common reference type. It was interesting to see that ref usage is higher than volatile!, but volatile! is much newer. I'm not surprised that agent is the least used reference type.

Average mutable reference usage per repository: 1.94
Repositories with no mutable reference usages: 7,245 (63%)

Truly bananas. Clojure libraries really do have less state. It probably isn't surprising if you've been programming in clojure for any length of time, but it's pretty wild to see the data back this up. I don't think I would have believed this 10 years ago.

Var Definitions and Macros

total var definitions: 1,228,404

Most var definitions are functions. No surprise. I am sort of surprised that over 20% of vars are defs. What are people defing all over the place? That's something I would like to follow up on at some point.

I consider defprotocol to be an important, fundamental building block, so to see that defmacro usage (2.93%) is similar to defprotocol usage (3.84%) is interesting.

Lein vs deps

These numbers don't directly match the latest state of clojure numbers (ie. over 50% deps.edn usage), but I don't think they're inconsistent since clojure repositories on github will be weighted towards older projects.

The "neither" category means that no deps.edn or project.clj file was found. Many of them were shadow-cljs projects. Detecting other build tools would be a good improvement in the future.

Local names

Clojurians like using short local variable names and this proves it!

Local name lengths

The most common local name is _ with 155,951 followed by this at 92,296. I'm not sure what it says that 4% of locals are throw aways. What a waste.

Local binding names that are a single letter

I couldn't find an official reference for what single character symbols are supported by clojure, but the edn spec states that:

Symbols begin with a non-numeric character and can contain alphanumeric characters and . * + ! - _ ? $ % & = < >

I like the idea of using Greek letters as locals, but I find using | or ! as locals offensive.

Additionally, : and # are allowed as constituent characters in symbols other than as the first character.

Does anybody do that? Well, there's only a single local that contains # in the name other than at the very end (which is used by syntax quote), but 21 repositories use locals with : in the name. It's not something I've thought a lot about before, but now that I know that I can...

Emoji and unicode in names

It's not clear if using emoji and unicode as var names is officially supported, but it's also not clear if it's not not supported6. I mean, it works when you type it into the repl:

user> (def 😊 42)
#'user/😊
user> (+ 😊 😊)
84

I'm not saying it's a good idea. Anyway, there are 25 brave repositories that use emoji/unicode in var definition names. There is a single namespace that includes unicode.

Usages range from questionable to ╯°□°╯ ┻━┻.

Failed Analyses

For the 13% of repositories that aren't included, what went wrong?

In the future, I'd like to do a better job of breaking down what errors happened during analysis, but here's the coarse breakdown for now.

Descriptions:

  • :empty-analyses: No analysis was produced. The most common (only?) reason is that no project files were found (only deps.edn and project.clj supported).

  • :no-findings: Analysis succeeded without error, but no findings were found. In theory, this could mean that everything worked and there was just nothing to analyze, but I'm assuming that it's an error for now.

  • various Exceptions: Exceptions thrown while trying to analyze a project.

  • :max-bytes-limit-exceeded: Repository checkout was over file size limit (currently 100Mb).

  • :download-error: An exception was thrown while trying to download the repository from github.

  • :read-error: Analyses for each repository were saved to intermediate files before being aggregated. This error means the resulting analysis file was unreadable.

Biases

I guess I wouldn't say the data is biased so much as unrefined. I tried simply looking at the "most popular X" among usages, but you get misleading results if you don't account for biases like:

  • popular libraries have more forks which amplify their usage patterns

  • outliers: a single or small number of repositories that do "weird" things can skew simple counts and averages

  • copy and pasting: for a number reasons, shared code is sometimes added to a repository by copy and pasting the code into a library rather than referencing the code as a dependency.

  • example and test code: test/example code can often exhibit very different usage patterns than "normal" code.

  • failed analyses: around ~87% of repositories were successfully analyzed. That's not terrible, but the 13% of libraries that failed to produce analysis probably aren't uncorrelated which can introduce biases.

Most of these biases are pretty straightforward to account for and will hopefully be addressed in future work.

Future Work

Most language ecosystems have figured out that directly supporting community repositories/tools like clojars is a big win. Based on some smart design choices by clojure, its maintainers, and the community, I think that clojure is well positioned to take advantage of tools that work at the ecosystem level. I'm not totally sure what other ecosystem tools look like, but I have some ideas.

Mundane Improvements

While the vision is to work on innovative ecosystem tooling, many of the planned improvements are mundane:

  • crawl other open-source hosts besides github

  • improve support for clojurescript and other clojure dialects

  • automate more of the data gathering and analysis

Use a database

Currently, the full analysis of every clojure repository is made available as a giant 600+ MB gzipped edn file. It would be great to make the data available in a more organized, structured, and queryable format (like a database).

Example Usages

Examples for clojure functions can be illustrative. It should be possible to pull up examples for any function.

Github search is bad. grep.app is better, but a search tool that only targets clojure code could be much, much better!

Java/Javascript Interop

One of the common roadblocks for learning clojure/clojurescript is that you eventually need some amount of host interop to be productive. Finding commonly used java/javascript host interop can provide valuable input for improving learning resources with common interop knowledge or creating wrapper libraries for frequently required host APIs.

Clojure Specs

Afaict, specs are designed with a global, universal orientation in mind. We should build tooling that leverages and supports this. As a simple example, building a searchable database of all specs should be straightforward.

N-grams

In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is a contiguous sequence of n items from a given sample of text or speech.

I'm not exactly sure what I'd expect to find, but it would be interesting to find common patterns and idioms in clojure code.

Namespace Alias usage

It's common to use an alias when requiring a namespace. Since aliases are short, it's easy for multiple libraries to "claim" the same alias. I'd like to build a database of all namespace alias usage so that it's easier to pick a preferred alias for your library that is less likely to conflict. This is probably a dumb idea, but at least it's an easy dumb idea.

Related Work

Footnotes

1. Nine projects were excluded because they are ginormous by themselves.
2. Clj-kondo also allows you to write your own linters, but that will have to wait for a future update.
4. Here's a blog post explaining more about iteration. https://www.juxt.pro/blog/new-clojure-iteration
5. CSS is hard
6. Clearly