Posted: September 6th, 2022
Update: September 7th, 2022. Added Failed Analyses section and fixed total repository count.
See source for full history.
I'm generally interested in tools like cljdoc that work at the ecosystem level. As part of my work on dewey, which builds an index of all clojure libraries on github, I thought it would be a straightforward extension to statically analyze all the projects found. A bare, shallow checkout of every clojure project found by dewey1 is only about 14GB, which is a very tractable size.
There's also a couple of features of the clojure ecosystem that make it an interesting target for wholesale analysis:
libraries address a wide range of problems
libraries tend to be stable and concise
clojure syntax is regular and minimal
clojure has strong support and conventions for namespaces
There are several tools available for analysis. Projects like cljdoc and getclojure use dynamic analysis, but for this initial run, I chose to use clj-kondo's static analysis. Given a project, clj-kondo will report2:
namespace definitions and usages
locals and usages
var definitions and usages
keywords
protocol implementations
Analysis Rate: 87%
Out of the 13,274 repositories found by dewey on github, 11,573 (87%) were successfully3 analyzed.
The main goal of this project is to make the resulting analyses available for other projects and tools to consume. The full analysis of projects can be found under the dewey releases in the analysis.edn.gz
file.
In addition to providing the data, I also wanted to do some cursory investigation based on the analyses which I present below.
The clojure.core library is rich and I always seem to find useful nuggets that I've somehow missed. Just to make sure that I'm not the only person with clojure.core FOMO, I thought I'd share some hidden gems found among the least used public clojure.core vars. To be fair, many of these functions are recent additions, but not all of them!
I'll often use pmap
for quick and dirty parallel computation. pvalues
(15 usages) and pcalls
(19 usages) also seem useful for quick and dirty parallelization.
(pvalues & exprs)
: Returns a lazy sequence of the values of the exprs, which are evaluated in parallel.(pcalls & fns)
: Executes the no-arg fns in parallel, returning a lazy sequence of their values.
The core library added a handful of useful parse-* functions in 1.11
like parse-double
(26 usages), parse-boolean
(6 usages), parse-long
(202 usages), parse-uuid
(25 usages). They do basically what you would expect. I usually search for the proper java interop call or go the lazy route and just use read-string
. I'm happy to have a better alternative now.
Another gem is iteration
(11 usages, once by dewey itself!) which was added in 1.11
. The doc string doesn't immediately illuminate why or how to use it, but it's great. Check it out!4
Two functions that I think are criminally under used are min-key
(256 usages) and max-key
(431 usages). They tend to get overshadowed by their much more popular cousin, sort-by
(4,926 usages), but are still useful in their own right.
(min-key k x) (min-key k x y) (min-key k x y & more)
Returns the x for which (k x), a number, is least.
If there are multiple such xs, the last one is returned.
max-key
is the same, but different.
random-uuid
(60 usages) is another 1.11
addition and a welcome one.
There's always a delay between new clojure versions and projects getting around to adopting and leveraging the latest additions. I was curious which of the latest features were gaining traction.
* seq-to-map-for-destructuring5
The results match my expectation that atom
is, by far, the most common reference type. It was interesting to see that ref
usage is higher than volatile!
, but volatile!
is much newer. I'm not surprised that agent
is the least used reference type.
Average mutable reference usage per repository: 1.94
Repositories with no mutable reference usages: 7,245 (63%)
Truly bananas. Clojure libraries really do have less state. It probably isn't surprising if you've been programming in clojure for any length of time, but it's pretty wild to see the data back this up. I don't think I would have believed this 10 years ago.
total var definitions: 1,228,404
Most var definitions are functions. No surprise. I am sort of surprised that over 20% of vars are def
s. What are people def
ing all over the place? That's something I would like to follow up on at some point.
I consider defprotocol
to be an important, fundamental building block, so to see that defmacro
usage (2.93%) is similar to defprotocol
usage (3.84%) is interesting.
These numbers don't directly match the latest state of clojure numbers (ie. over 50% deps.edn usage), but I don't think they're inconsistent since clojure repositories on github will be weighted towards older projects.
The "neither" category means that no deps.edn
or project.clj
file was found. Many of them were shadow-cljs projects. Detecting other build tools would be a good improvement in the future.
Clojurians like using short local variable names and this proves it!
The most common local name is _
with 155,951 followed by this
at 92,296. I'm not sure what it says that 4% of locals are throw aways. What a waste.
I couldn't find an official reference for what single character symbols are supported by clojure, but the edn spec states that:
Symbols begin with a non-numeric character and can contain alphanumeric characters and
. * + ! - _ ? $ % & = < >
I like the idea of using Greek letters as locals, but I find using |
or !
as locals offensive.
Additionally,
:
and#
are allowed as constituent characters in symbols other than as the first character.
Does anybody do that? Well, there's only a single local that contains #
in the name other than at the very end (which is used by syntax quote), but 21 repositories use locals with :
in the name. It's not something I've thought a lot about before, but now that I know that I can...
It's not clear if using emoji and unicode as var names is officially supported, but it's also not clear if it's not not supported6. I mean, it works when you type it into the repl:
user> (def 😊 42) #'user/😊 user> (+ 😊 😊) 84
I'm not saying it's a good idea. Anyway, there are 25 brave repositories that use emoji/unicode in var definition names. There is a single namespace that includes unicode.
Usages range from questionable to ╯°□°╯ ┻━┻.
For the 13% of repositories that aren't included, what went wrong?
In the future, I'd like to do a better job of breaking down what errors happened during analysis, but here's the coarse breakdown for now.
Descriptions:
:empty-analyses
: No analysis was produced. The most common (only?) reason is that no project files were found (only deps.edn and project.clj supported).
:no-findings
: Analysis succeeded without error, but no findings were found. In theory, this could mean that everything worked and there was just nothing to analyze, but I'm assuming that it's an error for now.
various Exceptions: Exceptions thrown while trying to analyze a project.
:max-bytes-limit-exceeded
: Repository checkout was over file size limit (currently 100Mb).
:download-error
: An exception was thrown while trying to download the repository from github.
:read-error
: Analyses for each repository were saved to intermediate files before being aggregated. This error means the resulting analysis file was unreadable.
I guess I wouldn't say the data is biased so much as unrefined. I tried simply looking at the "most popular X" among usages, but you get misleading results if you don't account for biases like:
popular libraries have more forks which amplify their usage patterns
outliers: a single or small number of repositories that do "weird" things can skew simple counts and averages
copy and pasting: for a number reasons, shared code is sometimes added to a repository by copy and pasting the code into a library rather than referencing the code as a dependency.
example and test code: test/example code can often exhibit very different usage patterns than "normal" code.
failed analyses: around ~87% of repositories were successfully analyzed. That's not terrible, but the 13% of libraries that failed to produce analysis probably aren't uncorrelated which can introduce biases.
Most of these biases are pretty straightforward to account for and will hopefully be addressed in future work.
Most language ecosystems have figured out that directly supporting community repositories/tools like clojars is a big win. Based on some smart design choices by clojure, its maintainers, and the community, I think that clojure is well positioned to take advantage of tools that work at the ecosystem level. I'm not totally sure what other ecosystem tools look like, but I have some ideas.
While the vision is to work on innovative ecosystem tooling, many of the planned improvements are mundane:
crawl other open-source hosts besides github
improve support for clojurescript and other clojure dialects
automate more of the data gathering and analysis
Currently, the full analysis of every clojure repository is made available as a giant 600+ MB gzipped edn file. It would be great to make the data available in a more organized, structured, and queryable format (like a database).
Examples for clojure functions can be illustrative. It should be possible to pull up examples for any function.
Github search is bad. grep.app is better, but a search tool that only targets clojure code could be much, much better!
One of the common roadblocks for learning clojure/clojurescript is that you eventually need some amount of host interop to be productive. Finding commonly used java/javascript host interop can provide valuable input for improving learning resources with common interop knowledge or creating wrapper libraries for frequently required host APIs.
Afaict, specs are designed with a global, universal orientation in mind. We should build tooling that leverages and supports this. As a simple example, building a searchable database of all specs should be straightforward.
In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is a contiguous sequence of n items from a given sample of text or speech.
I'm not exactly sure what I'd expect to find, but it would be interesting to find common patterns and idioms in clojure code.
It's common to use an alias when requiring a namespace. Since aliases are short, it's easy for multiple libraries to "claim" the same alias. I'd like to build a database of all namespace alias usage so that it's easier to pick a preferred alias for your library that is less likely to conflict. This is probably a dumb idea, but at least it's an easy dumb idea.
iteration
. https://www.juxt.pro/blog/new-clojure-iteration