🔗 Parsing PDFs (and more) in Elixir using Rust

247 words, 2 min read

Here's the thing about PDFs - they're complex beasts that require quite a bit of thinking to properly parse - they come in all shapes and sizes, and they can contain a lot of different types of data and formatting. 90% of the time, we just want to extract the text from the file, but that's not always easy - for the remaining 10%, well we won't be covering that in this blog post.

If you've been in the Elixir world for long enough, you'll probably have tried to parse a PDF file and realised that it's not as easy as it seems. A quick look on the Elixir Forum will quickly show you that there is no simple way to do it.

Most people will tell you to upload the file to S3 and use a Lambda to handle the contents. Offloading to AWS Lambda might seem elegant at first ("Look, Ma, no dependencies!"), but it comes with its own baggage:

You're adding network latency to what should be a simple operation

AWS costs can spiral if you're processing lots of PDFs

You're now dependent on external services for core functionality

Debugging becomes a distributed systems problem

These aren't ideal solutions - and software engineering is already made more complicated than it needs to be at times - we don't need to add more complexity to the mix.

We need a robust, native solution that plays nicely with the BEAM. So how do we do that?

continue reading on www.chriis.dev

⚠️ This post links to an external website. ⚠️

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.