Renaming PDF files based on their content

My accountant recently moved my business accounts system over to FreeAgent. One thing I like to do is keep a copy of every invoice PDF that I issue in a folder on my computer as a back up, just in case the online systems let me down.

As I was learning new systems, I took advantage of this time to write a script to rename the invoice PDFs from Nineteen-Feet-Limited_Invoice_123.pdf to my preferred format of 00123 19FT {company name} {date of invoice}.pdf

To do this, I wrote a simple bash script that uses pdftotext along with the standard grep, tail, sed, etc. The heavy lifting of course being done by pdftotext.

pdftotext

pdftotext will extract the text from a PDF file and if you pass it the -raw option, it’ll remove any attempt at displaying in order with whit space formatting which can sometimes make it more predictable for extracting data.

The command is: pdftotext -raw {filename.pdf} {filename.txt} where you can use - to send the output to stdout which is useful for piping to other tools.

Here’s an example of the invoice PDF:

$ pdftotext -raw Nineteen-Feet-Limited_Invoice_123.pdf  -
1/1
If you have any questions concerning this invoice, please contact Rob Allen
Starling
Bank/Sort Code:
123456
Account Number:
12354678
Payment Reference:
123
Company Registration
Number:
08448563
Invoice 123 ? 30 May 2024 ? Payment due by 30 June 2024
Invoice 123 ? 30 May 2024 ? Payment due by 30 June 2024
Quantity Details Unit Price (?) VAT Net Subtotal (?)
1 Thing 10.00 20% 10.00
Net Total 10.00
VAT 2.00
GBP Total ?12.00
Nineteen Feet Limited
2 Copenhagen Street
Worcester
WR1 2HB
VAT: 159779931
Customer Ltd.
Street
Town
Post code

One interesting thing that I’ve noticed that sometimes a line repeats itself in raw mode which doesn’t happen in normal mode. The order of the text is also completely different which emphasises how the PDF format is not linear as the position in the file is not related to the location when rendered.

Extracting data

I need to extract the company’s name, the invoice number and the date. This is done by piping the output of pdftotext to various utilities.

Company name

Firstly I need to find something to anchor my search to for the data that I’m looking for. For the company name, you can see that it’s directly after the VAT number, so we can search for that with grep:

company=$(pdftotext -raw "$in_filename" - | grep -A 1 '^VAT:' | tail -n 1)

We use grep to find the line starting with “VAT:” and with the -A 1 switch, grep will return the line with VAT: on it along with the following line. Piping that to tail -n 1 will result in just the last line, which is the company name we require.

Invoice number

We follow the same pattern for the invoice number, by anchoring from the “Payment Reference”” text:

invoice_number=$(pdftotext -raw "$in_filename" - | grep -A 1 '^Payment Reference:' | tail -n 1)

We then paddled with leading zeros to make it 5 characters long using printf:

invoice_number=$(printf "%05d\n" "$invoice_number")

Date of invoice

For the date of the invoice, we need to do more work. Firstly we need find the date string which is in this line:

Invoice 123 ? 30 May 2024 ? Payment due by 30 June 2024

You’ll note the ?. These are bullet characters that are available in this encoding. I used sed for this:

date_string=$(pdftotext -raw "$in_filename" - | grep '^Invoice' | LC_ALL=C sed -En 's/Invoice [0-9]+ . ([0-9]{1,2} [A-Za-z]+ [0-9]{4}).*/\1/p')

We use grep to find the line starting with “Invoice” and then pipe that to sed, having first set the locale to C to disable the error about encoding mismatch. The sed command finds the first string that has the ([0-9]{1,2} [A-Za-z]+ [0-9]{4}) pattern and places it into a capture group which then replaces the entire string with that group.

In this case, it replaces “Invoice 123 ? 30 May 2024 ? Payment due by 30 June 2024” with “30 May 2024”

We can now format the date correctly using the date command:

date=$(date -jf "%d %B %Y" "$date_string" '+%Y-%m-%d')

date is one of the more complicated tools out there. The -j prevents it setting the computer’s date and then -f "%d %B %Y" specifies the input format. After the string to be parsed, we then specify the output format using the rather unusual + option. Why they didn’t use -o or something similar is beyond me!

However, we now have the date.

Putting it together

Finally we put it all together to create the output file and rename our PDF:

out_filename="$padded 19FT $company $date.pdf"
mv "$in_filename" "$out_filename"

Not that we put quotation marks around a filename as you never know if there’ll be a space in it which would cause it all to break horribly.

Final thoughts

One of the really nice things about unix is that there are lots of little utilities that do useful things. Combining them by piping the output from one to another is incredibly powerful.

I enjoy automating little tasks like this to make my life easier. I assign it to a keyboard shortcut via Keyboard Maestro so that I can select the file in Finder and just press control+option+command+i to rename my invoice filename.

This article was posted on 4 June 2024 in Shell Scripting