October 11, 2025 | 21:00
Reading-Time: ca. 6 Min

Automatically Reading PDF Forms

The Portable Document Format (PDF)¹ is a great example of how an originally brilliant concept for displaying print documents has been ruined over decades. Initially conceived as successor to PostScript,² it has degenerated into a universal container format. Text, images, vectors, scripts, fonts, form data, even complete 3D models there’s hardly anything, that can’t end up in a PDF, including Doom.³

The PDF Pseudo-Standard

In practice this means: Hardly any two PDFs are alike. A single piece of information can be packaged in a dozen different ways, depending on which program was used to create it.⁴ And we haven’t even gotten to the topic of signatures yet.⁵

At least there are ISO standards for long-term archiving, such as PDF/A.⁶ That provides at least some structure and serves as a reference for all systems and viewers so nobody is forced to use Adobe products, the company behind the infamous Acrobat Reader.⁷

It may sound confusing: PDF is the worst possible exchange format and yet at the same time it’s the best, because it’s universal and can be used anywhere.

PDF Forms - Do it Right!

People keep sending me PDFs I’m supposed to fill out. However, only a few of them are actually fillable within a browser or PDF viewer. Am I really supposed to print them out, fill them in by hand, scan them and send them back? Or sketch something freely into the PDF with my mouse? Situations like that make me want to scream:

Create fillable PDF forms, not PDFs that just look like forms!

One possible explanation: Many of these “professional” documents come from Microsoft Office, which in the year 2025 still cannot create fillable PDF forms.⁸

If you want to use PDF forms in your processes, please use LibreOffice instead.⁹ I don’t recommend LibreOffice because it’s particularly great, but because it’s an available and, most importantly, working solution. With a few pitfalls you should know.

The PDF forms created in this way can be filled out and returned on all common systems, viewers, and browsers. Without any proprietary subscription-based software including vendor lock-in.¹⁰

Creating a PDF form in LibreOffice, with the finished PDF shown in a viewer on the right

What’s Next?

How to create PDF forms in LibreOffice is not the topic of this post. I want to focus on what happens afterwards when the filled out and returned forms start piling up in a folder. This is where the real magic starts and the whole tragedy unfolds aswell.

Instead of automatically processing the already structured form data, the following happens: PDFs end up in nested (SMB) folder structures or worse, in Outlook/ Exchange mailboxes. Lost forever. If at all, transfer of information happens only manually.

That’s probably the real reason why many office workers have two displays: On the left a PDF with its contents. On the right the ERP system, where the information is typed in manually. Or the nicely structured PDF is drag and dropped from left to right as unstructured BLOB¹¹. Classic style with a paperclip icon. Digitalization straight from Absurdistan.

Extracting Information from PDF Forms

Below I’ll describe my own process for automatically processing data from PDF forms using a bit of Bash and free software. As always, without claiming completeness or universality: Your mileage may vary.

I try to keep the dependency list for all my technical stacks as short and as simple: pdfcpu,¹² jq,¹³ and curl.¹⁴ While pdfcpu has to be installed manually from its Git repository, the other two are included in the standard repositories of most GNU/Linux distributions.

Left: the view of the filled PDF form. Right: the data pdfcpu extracts from the form fields

The compact Go-Binary pdfcpu extracts PDF form data into a structured JSON file.¹⁵ This intermediate format is then processed further on with jq. Two common scenarios:

Exporting to a CSV File

The following snippet shows how to generate a CSV file for further processing from a set of .json files in a folder. The form field names correspond to those in my previously created PDF. For simplicity, the yes/no option fields are represented by numbers.

OUT_FILE="./exportfile.csv"

for f in *.json; do
  jq -r '
    (.forms[0] // {}) as $f
    | (($f.textfield // []) + ($f.datefield // []) + ($f.radiobuttongroup // []))
    | map({key:.name, value:(.value // "")}) | from_entries
    | [.Datum, .Firma, .Funktion, .Unterschrift, .Email, .Rufnummer,
       .["1"], .["2"], .["3"], .["4"], .["5"], .["6"], .["7"]]
    | @csv
  ' "$f" >> "$OUT_FILE"
done

Often CSV files still represent the only way of data exchange between systems. Mostly in Combination with another absurdity: Storing them on SMB network shares across intentionally separated VLAN security zones. A nightmare for operations and information security aswell.

Interaction of pdfcpu and jq using a PDF form

Sending to a REST API

A much more modern approach is, of course, data exchange via a REST API.¹⁶ For older applications that don’t have one, you can take a peek at my REST API skeleton and adapt it. That at least helps to ease some of the operational pain around SMB shares and security.¹⁷

The following snippet shows how to use jq to retrieve the collected contents of many PDF forms in a directory and send them to a REST API using curl:

API_URL="https://any-rest.api"
API_TOKEN="**SECURITY-TOKEN***"

for f in *.json; do
  PAYLOAD="$(jq -c '
    (.forms[0] // {}) as $f
    | (($f.textfield // []) + ($f.datefield // []) + ($f.radiobuttongroup // []))
    | map({key:.name, value:(.value // "")}) | from_entries
    | {
        datum:        .Datum,
        firma:        .Firma,
        funktion:     .Funktion,
        unterschrift: .Unterschrift,
        email:        .Email,
        rufnummer:    .Rufnummer,
        antworten: {
          "1": .["1"], "2": .["2"], "3": .["3"], "4": .["4"],
          "5": .["5"], "6": .["6"], "7": .["7"]
        },
      }
  ' "$f")"

  HTTP_CODE="$(
    curl -sS -o "$(mktemp /tmp/XXXXXX.json)" -w '%{HTTP_CODE}' \
      -X POST "$API_URL" \
      -H "Authorization: Bearer $API_TOKEN" \
      -H "Content-Type: application/json" \
      --data-raw "$PAYLOAD"
  )"
done

Conclusion

With this post I wanted to show how processing of filled PDF forms can look like. The snippets are intentionally just rough sketches but pointing in the right direction. For real-world production use of course a few more details are required, details that Copilot and ChatGPT of course don’t provide.

I’m happy to offer my expertise wherever digitization should be sustainable, automated, and - above all - independent of proprietary and costly solutions.

Or, to put it more provocatively: Anyone who uses PDF forms without automating their further processing is actively preventing digitalization.

With that said,
Yours,
Tomas Jakobs

Support this blog - Donate a Coffee

October 11, 2025 | 21:00Reading-Time: ca. 6 Min