Let's do what we always do... let's export the data from our language server into a SQL database. After all, the Language Server has all of this information, but its query language is a tedious JSON-RPC situation.
This post was an excuse for me to learn a bit more about language servers; come along for the ride!
Much of our mutual drudgery is slurping data from one end of our systems to another through a tiny, awkwardly-shaped straw. Some examples, to wit:
Data | Straw | Query System | Alternative |
---|---|---|---|
Linux Process Information | /proc file system |
ps |
osquery |
Open Files, Ports | /proc on Linux |
lsof ... |
grep |
AWS Instance Metadata | DescribeInstances API |
aws CLI with --filters flags |
IASql? |
File System Metadata | listdir() , stat |
find |
fselect ? |
Git | git log |
git log |
mergestat-lite ? |
Type Information about your Code | Language Servers | VSCode, compiler | 🌟This blog post!🌟 |
Compared to OS X's infrastructure for listing processes,
the /proc
file system is lovely; we have tools (like grep
) for querying file systems!
Let me know if you've got items to add to my handy table.
Language Server History #
Microsoft, with Visual Studio Code, led the way in standardizing Language Servers. They describe their history on github. The specification is at https://microsoft.github.io/language-server-protocol/.
With a nod to XKCD 927, VSCode doesn't actually
use a Typescript Language Server! It uses a "TypeScript Server" (tsserver
).
The Language Server protocol evolved as a generalization, but
the migration hasn't
happened. For this project, I used one of the wrappers that wraps tsserver
in a language server.
Language Server Protocol #
Language servers typically run as a separate process, and communicate with their parent via stdin/stdout pipes. This makes them behind the scenes, as they're managed by your IDE, but it means it's tricky to see their logs, and, at least for me, it made it a bit tricky to attach a debugger to the language server.
Using stdin/stdout, and needing to "read lines" to parse the content-length exposes you to some fun with Python's bytes versus strings, UTF8 encoding, blocking reads and writes, etc. I'm sure my script breaks if you're not using UTF-8, but surely you're using UTF8 everywhere.
The protocol pushes a Content-Length
line, \r\n\r\n
newlines, and then a
JSON blob of content. This happens bi-directionally. There are notifications
(which go one way and don't expect a response) and requests (which expect a
response). The
specification
has a list of methods. A reasonable person would use a pre-existing client
library with typed support. This blog post was not written reasonably.
The protocol is stateful. You pretend to be an editor and say that you "didOpen" a document, and then you can ask for hover annotations.
Often, you can script APIs by inspecting what they do in the Network tab
and using "Copy as curl" and scripting with bash
. This is not easy to do
for language servers. Sometimes CLI tools for the API are the way to go,
but I didn't find much (though maybe should have tried
lsp-cli).
Having said that, I nerd-sniped myself, and here's querying language servers in
bash
. Note how counting content-length is a pain, but we have printf
to pad
our JSON for us. The sleep
is insidious: the language server is happy to exit
whenever its input stream is closed, and Node sys.stdout.write()
is
asynchronous. (See this
note: "Writes may be
synchronous depending on what the stream is connected to"... Or also this
investigation.)
Let me know if I've nerd-sniped you, and you figured out how to get rid of
the sleep without resorting to reasonable approaches like python
.
{
printf "Content-Length: 1000\r\n\r\n%1000s" '{"jsonrpc":"2.0","method":"initialize","id":1,"params":{"capabilities":[],"processId":'$$'}}'
printf "Content-Length: 10000\r\n\r\n%10000s" '{"jsonrpc":"2.0","method":"textDocument/didOpen","params":{"textDocument":{"uri":"file:///Users/philip/src/hono/src/utils/url.ts","languageId":"typescript","version":1,"text":'"$(cat /Users/philip/src/hono/src/utils/url.ts | jq -Rs .)"'}}}'
printf "Content-Length: 1000\r\n\r\n%1000s" '{"jsonrpc":"2.0","method":"textDocument/documentSymbol","id":3,"params":{"textDocument":{"uri":"file:///Users/philip/src/hono/src/utils/url.ts"}}}'
sleep 2
} | foo/node_modules/.bin/typescript-language-server --stdio | grep '"id":3,' | jq | head -n 10
{
"jsonrpc": "2.0",
"id": 3,
"result": [
{
"name": "_decodeURI",
"kind": 14,
"location": {
"uri": "file:///Users/philip/src/hono/src/utils/url.ts",
"range": {
🥁🥁 Here's the same data, queried sensibly:
$duckdb language-server-db/sample/hono.db --line "
SELECT *
FROM symbols
WHERE uri LIKE '%url.ts'
ORDER BY symbol_start_line DESC
LIMIT 1
"
uri = file:///Users/philip/src/hono/src//utils/url.ts
symbol_name = decodeURIComponent_
symbol_kind = Constant
symbol_start_line = 309
symbol_start_character = 13
symbol_end_line = 309
symbol_end_character = 53
symbol_container_name =
Visualization #
So, how should we look at this data? Probably the editors, with their "inlay hints" and their mouse-overs are doing it right, but maybe like an annotated ("glossed") medieval manuscript? Maybe like my student's copy of The Aeneid, with lots of hard to decipher vocabulary notes?
This is a long-winded blog post, so we also get the opportunity to quote Bret Victor:
We expect programmers to write code that manipulates variables, without ever seeing the values of those variables. We expect readers to understand code that manipulates variables, without ever seeing the values of the variables. The entire purpose of code is to manipulate data, and we never see the data. We write with blindfolds, and we read by playing pretend with data-phantoms in our imaginations.
One of the all-time most popular programming models is the spreadsheet. A spreadsheet is the dual of a conventional programming language -- a language shows all the code, but hides the data. A spreadsheet shows all the data, but hides the code. Some people believe that spreadsheets are popular because of their two-dimensional grid, but that's a minor factor. Spreadsheets rule because they show the data.
It's a stretch, but, in this case, the hidden data is the types the compiler knows but are hidden to us as the readers.
And, so, I fed hono into it. (Hono is like Express, but it's the default web framework for val.town...)
Source #
https://philz.github.io/language-server-db/ and https://github.com/philz/language-server-db have the Python script, a full page of sample output, etc.