At OL we find understanding what is difficult about a problem is as useful as figuring out how to bring it to life. This is a story where we stumbled into many, many of these, to overcome what we think is a really critical challenge for robust adoption of generative capabilities into tactical environments. We wanted to share a warts’n’all talk through of what we did so you can avoid some of the same pitfalls and learn from our battle-scars (excuse the pun).

The challenge

As you may have seen, our mission is to ‘unfold the unthinkable’ though at times it appears we see that as more of a competition, rather than a goal… in hindsight, this was one of those times.

Defence is a fascinating place for AI; it has so much to gain through its use, but yet so much to lose if it goes wrong. It often means that normal rules don’t apply - tools that may be used broadly in the commercial sector might be too non-deterministic, inaccurate or vulnerable to integrate properly into operational use. This has left us with a lasting itch: how can we overcome these limitations with SOTA capabilities? How can we enable Defence to make use of the latest and greatest in AI?

Over the years we have all been exposed (perhaps more than we would like) to how data moves around the battlespace. It has always been surprising how speech remains the primary modality at the tactical level, and moreover how much could be done with the voice data. If it was structured, there could be a single version of the truth, enabling automation, predictive analytics, human-machine interaction … the list goes on.

Inevitably these two areas collided, and we thought we would see if there was a way to get to SOTA Speech-to-Text models to a level where they could dependably form a structured packet of data that could be exploited. We thought we should capture the data at source, so let’s see how close to the tactical operator we could get - how about their radio? This means we are using mission critical tactical information, so the margin for error is pretty low. As anyone who has used their smartphones voice assistant will attest, we didn’t set ourselves the easiest task - hear bee grins the tail of our Miss adventures…

Getting representative data

Whilst choosing to encode at source has many benefits, it comes with huge drawbacks: a tactical user passing a message is likely to be under duress, speaking very quickly, surrounded by a fair amount of seemingly random loud noises, which is problematic for a low SWAP device. Thankfully, to overcome some of these challenges and ensure the right information is passed even when under high amounts of stress, tactical operators often use tightly structured patterns for their messages. This enables us to remove some of the randomness and means our dataset generation does not need infinite word positions.

So far it’s difficult business as usual, but trying to get real data sounds complex, difficult, and probably a bit dangerous. To get the message structure right, we had our in-house expert on all things Army dust off his old headset and engage his voice-procedure muscle memory, problem one solved. The problem of the random and yet specific noise was really challenging, until a Friday evening watching Black Hawk Down spun an idea. Before we knew it the whole team had the best scenes playing on full volume as they screamed messages into a headset, in what turned out to be a somewhat immersive experience, and undoubtably concerning for our neighbours!

After pulling in a few separate voices, with a few different intensities of noise we had our baseline data and were ready to crack on.

Whispering correctly!

So first hurdle complete, onto the next…

We haven’t chosen the easiest dataset: loud random noises, speaking quickly and with odd and unusual abbreviations, can lead to even the best Speech-To-Text models being a bit hit and miss. In a tactical situation this is not a characteristic that goes down well, so we knew we might have to get creative.

We started by choosing our model. It is no surprise that we decided to use Whisper by Open AI. It has a great selection of model sizes, high performance and a wealth of quantization research. Because we didn’t think the challenge was hard enough, we decided it was unlikely that an operator was going to be lugging around an RTX 4090. We should probably bound the compute to the available hardware, most likely a smartphone. As you can imagine this wound down our performance options somewhat. We conducted an inference test and set our upper bounds of 15 seconds. This meant we were left with the Whisper-small.en as our model, but we were able to run a Medium model when we used the Faster-whisper models by Systran.

We took two approaches which worked in concert with each-other, one which looked to symbolically adjust the core model output based on experience, then the more conventional finetuning on our data. Finetuning the model in the middle of summer turned Tom’s basement into a Scandinavian Sauna, but it was worth it. Once complete, we were able to generate a Word Error Rate of 20% on even the highest of noise samples. Even with this high error rate the symbolic overlay gave an additional layer of validation which proved to be very helpful. Through all this, there were a few things we learnt which are worth being aware of if you decide to embark on a similar adventure:

1) Bigger models are not always better. We had quite a limited vocabulary in comparison to general text transcription. When we used the Medium models, we actually found they often performed worse than the small. There are many reasons this could be, and Large outperformed them both, but worth considering, don’t assume the biggest model you can run will be the most performant.

2) Watch for backdoors. After having a walk through some of the more research speech to text libraries, we found an open link to download from an AWS S3 bucket. This was innocent and designed as a simple way to pass a set of segmentation model weights. But if you’re a fellow avid Darknet Diaries listener (*other cyber security podcasts are available) this will no doubt have raised your hackles. It certainly did for us, and it was a red line. We had two choices, redesign the code or avoid it altogether. Sadly, its transcription function was pretty crucial, so rebuild it was! But the lesson is that you really should scrutinise your packages, don’t assume that because it is on PyPi it is safe!

3) Word transcription. This is hard to do reliably. Whilst some of the larger models come with this functionality, it can be a bit ropey. Some of the research-based libraries are useful, but they can be really sensitive. As they work by effectively segmenting the MEL spectrograms, if you have a high level of noise, this can really upset things. But that being said, we found some to be workable, but expect a margin of error in any data that represents reality.

Text into data

Now we had our transcribed text, we needed to turn this into structured data. In our chosen challenge this is one area where we are a little bit fortunate. Because of the way the messages are formatted, and largely because human ears suffer the same challenges when it comes to high noise messages, there is a tidy comms protocol and really only a small subset of words that we care about. It is highly unlikely that the word Daffodil is going to appear in the middle of a fire-fight, though we could always be proved wrong!

This meant that we could trim the vocabulary we were looking to extract and integrate simple logic which would search for the right words given the context. One challenge you can face here is the amount of homophonic semantic variance, which can be correct. This is fine for transcription, but when you are looking to extract specific terms it can become a nightmare, or a night-mare, or maybe a knight mare. So figuring out how to find the right word(s) and punctuation was really challenging, taking a hefty combination of logic and probability to solve! Even in a simple domain, this component should not be underestimated and is likely where a solid 40% of your time will be spent.

Running on a Smartphone

Ok, now we are getting to the grizzly bit. We wanted to get this to run on a mobile, as it was likely to be the only device with some element of compute that we could count on. The flexibility of Android meant it was the only solution. As we didn’t want to cheat, we felt it unlikely that the MOD procurement would see SOTA ultra-fast mobiles dripping in Snapdragon chips, so we chose one a few years old. From testing our models, we had outlawed the use of the GPU so thought as long as it is below the RAM threshold this should be ok. So we got started and this should be easy right? Mobiles have easy APIs to access location data and audio and we can run Python on Android… easy.

However, whilst Python does run on Android… PyTorch, quickly becoming the backend library for all models is not supported. Hmm ok, well there is this thing Pytorch Mobile, that should work? Nope not in any way that would make sense as part of a wider system. So this left us with the nuclear and semi-nuclear options. We can try and get a Linux VM running on the mobile, then if that fails we might just have to root the device and turn it into a Linux machine (you can guess how little we wanted to do this). Thankfully we found a few apps with varying degrees of emulation. Termux was useful, but sadly it just didn’t have the power of a full distribution, but we stumbled across UserLAnd which enabled full(ish) Linux distributions to be installed on to the phone.

So we have the Python running - success! Not so fast… this is now a VM not an app, running in Android, meaning accessing location was not easy to achieve, or even passing data between the environments was very difficult. Not even the file structures were consistent! So, for something we thought would be so simple, we ended up having to create a hugely complex structure to message pass between a capture app running the microphone and location services and the Python code. This took much swearing to achieve. We won’t even go into the lack of control you have over app resource allocation, not helpful when RAM is critical to model inference performance…

Screenshot of the Python code running in the UserLAnd emulator on an Android phone.

TAK Integration

The final piece of our puzzle was getting the information to show up on a TAK end user device. TAK (Team Awareness Kit) is a framework which is open source, and has a common message schema and plugins based around geographic based information and displays. It is a great concept, but it can be an absolute a@#& to work with at times. There are multiple versions of TAK: covering civilian and military use. CivTAK runs somewhat behind the military version that receives significant investment from multiple governments but it is at least openly available. Then there are a plethora of servers, apps for different platforms, and API libraries all offering easy to use solutions to build a TAK solution. This meant having to navigate a plethora of official and unofficial software, docs, and approaches. Lets call our success rate stochastic and leave it at that. Honestly, try and find a good/accurate definition of the full CoT format (Cursor on Target – a common TAK message schema), we dare you. And even when you have figured out how to send a message, different application versions or systems act differently when you send it… by this point we were half way under the desk rocking slowly back and forth.

It works!

But just as we were about to give up a little blue circle of hope appeared on the screen, along with many shouts of exaltation echoed around our virtual office. After many fights, we had managed to get the end-to-end process to run, we could send a message about an enemy tank 500m away and then the sender and the target would pop up on the map in front of us. This could have been anti-climactic, but I assure you by this point it wasn’t.

Screenshot of ATAK showing the processed voice report disseminated across the ATAK users' end user devices.

Screenshot of WinTAK showing the processed voice report disseminated across the ATAK users.

How can this be exploited

We fought a lot to get this to work, and we knew it wouldn’t be easy, but the value we thought was worth the headaches. We showed one example (the integration with TAK) as proof of how useful this could be. But we think this could have huge implications for human machine teaming, interacting with autonomous systems, real-time analytics, predictive analytics, and the list goes on. By creating data at the source the Army can act faster, make quicker decisions whilst doing exactly what they were going to do anyway! So hopefully just proving its possible will provide some utility to the end users at some point!

TAKtical WHISPERing