On the Shave Architecture
Accompanying yesterday’s post announcing Shave I thought it might be interesting for some of you to learn a bit more about what is going on under the hood in the tool.
It will come as no surprise that it started out life as a clone of Dave Coffin’s python transliterator. That tool was the best one available when I started out learning Shavian. But I found it to be imperfect. My biggest complaint at the time was that it didn’t preserve HTML markup very well, which made for very difficult reading when you fed it a typical modern website.
I “forked” it (there is no repo for Dechifro’s tool, unfortunately), and iterated on his original code for my own personal use. While I managed to fix the layout issues sufficiently early on, I was less fortunate with adapting the code to do a better job of heteronym disambiguation, or trying to fix some of the stranger spelling errors the morphological rules were producing. The final nails in that coffin were the inability to run interpreted code on iOS1, and the atrocious performance I was seeing with that stack, particularly when I transliterated large batches of small documents.
So I set about building a new stack from scratch, this time in Swift2, so it would be performant and would run natively on any device that I care about3 (sorry, Windows users). I built it around Dave’s dictionary files, and topped it up with the Readlex lexicon. With the fresh start, I was able to build a markup preserving pipeline from the ground up.
The pipeline this evolved into breaks the work up as follows:
Split the input into blocks (block level elements such as <p> or <div> in html, paragraphs in plain text), and for each of those:
- Tokenize the input. Any recognized markup (basically: HTML tags) get set aside, remembering the offsets in the stripped token stream where they came from.4
- Pass the – now plain-text – tokens into the the pipeline.
- Detect sentence boundaries, figure out punctuation, normalize punctuation (quotes vs. apostrophes – makes a big difference for the next stages)
- Run a UDPipe part-of-speech (POS) tagger over the block.
- For each (word token, POS) try a direct lookup in the dictionary. Result: list of candidates with probabilities, governed by the source of the word (Readlex? Supplement? Was it from the user dictionary?) and word frequencies.
- If that fails (the word wasn’t in the dictionary), we try the using morphological engine. It attempts to break up the word into stems and affixes and glue them together using Shavian spelling conventions (e.g., inserting ·𐑩 or ·𐑦, or turning a latin ‘-es’ into ‘-𐑩𐑟’ or ‘-𐑩𐑕’ depending on context). This also results in a list of candidates with confidence levels.5
- Disambiguation: at this stage in the pipeline we have one or more candidates for each word token in our block. Truly ambiguous ones (“Read read the book”. “we saw the band live“) get passed to my dedicated hetronym disambiguation ML model (word-sense-disambiguatior, or WSD), which adjusts the confidence levels for each candidate.
- Another round of disambiguation comes from figuring out which words were proper nouns and which words were just capitalized because they were the first word in the sentence. I still haven’t fully cracked this one, but it’s getting better: the POS tagger, knowledge of which words are usually proper nouns, and some other heuristics do an okay job here.
- For any ambiguities that have not been resolved to the desired confidence level, fire off an event with all the candidates and any relevant metadata. The client (the web UI or the CLI) then decides what to do, with the user’s help if it’s running interactively.
- Finally reconstruct the markup for the block that we stripped out at the beginning. Figure out where the token boundaries are in the output stream, insert the tags there.
That’s the forward (latin to Shavian script) mode. The reverse mode is almost identical; the implementation details differ (using a strategy pattern.) The main differences are:
- I don’t have an off-the-shelf POS tagger for Shavian, sadly! So no POS tagging at the beginning.
- I have not implemented any where as many morphological rules yet, so I sort of brute force the lookup. I have created supplementary Readlex entries, by pumping a large corpus of words (with POS info etc.) through the forward transliterator, with some common spelling variations thrown in the mix too (trap/bath, cot/caught, etc). This works remarkably well; it extends the vocabulary by a factor of 4 or 5, bringing the number of unknown words way down for most prose I’ve tried it on.
- Morphology: so far I’ve implemented plurals and possessives. I tried expanding this set of rules substantially, but the results were not yet as good as the brute force lookup, so I left it out of the first release.
- Disambiguation: this is where reverse gets really hairy. It turns out that this is very hard! There are many more homographs in Shavian than in traditional orthography, unfortunately. The tricks I employ at the time of writing are:
- Bigram dissambiguation – trained on the worst offenders (𐑲 – I, eye, aye? 𐑞𐑺 – there, their, they’re? etc), this one does a pretty good job actually.
- POS tagging. This is again a bit brute-forcy – the combinatorics are not in our favor, but if we fill a particular candidate in and pass it to the POS tagger, does it make any grammatical sense?
- Finishing: sentence initial capitalization. Proper nouns are only tricky for the single-namer-dot convention (not to mention just not using them, as Kingsley Reed advocated for unambiguous cases.)
The library has two front ends of note: a CLI (which I will release to the world soon enough), and the web ui which I released yesterday.
The web app architecture built around the library is quite straight forward: a Swift binary called shaved runs on my Ubuntu server. It communicates with the client via SSE. The server process’s wiring is entirely stateless. The only server side-state is the files produced in the transliteration.
For the e-book UI, I felt that most users would not really care for having to create accounts, so I opted for a fairly lightweight session management solution: a unique client ID that is cached in the browser’s local storage. The client hands the server this key to find all ebooks that were created with that key.
The ebook transliteration runs asynchronously, and that’s where the SSE protocol comes into play: as each chapter finishes transliterating, the server sends its ‘events’ (that is, the lists of remaining ambiguities or unknown words) to the UI. When the user changes the resolution of that event, the browser sends a ‘patch’ back to the server. The chapter is re-transliterated, and the updated events for it are sent back again, replacing the previous set.
To prevent the server getting too heavily slammed, it maintains two separate queues for jobs: one for quick jobs and one for the heavier ones. The full ebook transliterations are assumed to be pretty heavy, and so are restricted to have at most two running concurrently. The user gets feedback if his or her job is queued. (Quick jobs just seem… slow.)
I had a blast getting the UI with the dual texts and the correction widgets to flow smoothly and intuitively. I’m quite pleased with how the synchronized scrolling and the visual feedback of the correction workflow panned out.
Next steps I may or may not pick up in the near future:
- Custom font choices / more ‘canned’ typefaces to choose from. I shipped my own Bernie Sans Beta with it, well, because. But there’s no particular reason why people shouldn’t be able to choose Doolittle, Ormin, Inter-Alia, Noto Sans or any other font they prefer.
- Collaborative editing. Right now, you can share your ebooks (just copy the URL), but the other users can only see your corrections and export it. No way for them to join in on the editing, yet.
- Sharing e-books. It’d be nice to be able to ‘publish’ your corrected ebooks, and for other users to be able to browse the thus growing corpus of Shavian literature in one convenient place.
- Engine/correction UI improvements:
- There are some issues around editing compound/hyphenated words that need fixing
- I would like to surface options for apostrophe styles (keep/only possessives/none) and trap/bath and cot/caught spelling variations.
- Continue training and improving the heuristics. Still too many wrong choices for some hetronyms.
- Build my virtual keyboard from Shaw-Type into the tool!
- Carry on expanding the dictionary! Indeed, there’s overlap with my Shaw-Spell project, hand hopefully I’ll be able to give Evan a pull request with some of my additions soon.
- Maybe add some feedback functionality? Ability to flag specific errors and report them to me?
I most certainly cannot promise anything, but do let me know if there’s a feature you would like to see implemented.
– Joro
Footnotes:
- I created a Safari transliteration plugin pretty early on in this process – it had to be native if I’d ever want to upload it to the App Store ↩︎
- I had never written Swift before, but had always wanted to learn the language. This project was an excellent vehicle for this. ↩︎
- Perhaps not very widely known, but Swift is in fact open source, and has excellent support for Linux. The only thing you give up when making it cross-platform, is some of Apples proprietary OS-specific bundles. ↩︎
- This also includes HTML tags inside words, like in drop-cap initial letters of chapters. We do best effort reconstruction on the other end – it usually works great. ↩︎
- I’m brushing over some interesting details here, including a divide-and-conquer strategy for compound (e.g. hyphenated) words and a number of other edge cases, such as Roman numerals and month- and weekday names, which don’t get namer dots in Shavian. ↩︎
Leave a Reply