Three things we've learned from parsing 2800 credit reports

Marcel Pauly, his colleagues from SPIEGEL Data and BR Data evaluated reports from Schufa, the most influential credit agency in Germany. The documents were provided by 2800 volunteers in a crowdsourcing project. Some documents came scanned, some were photographed. It was an OCR nightmare.

In February 2018, the organisations Open Knowledge Foundation Germany (OKFN) and AlgorithmWatch launched their crowdsourcing project OpenSCHUFA. Their goal was to "crack" Schufa, the most influential credit agency in Germany, and to learn more about its scoring algorithm. The participants requested their free personal credit report and provided scans or pictures of it to the project. We at SPIEGEL Data had the opportunity to work with the documents together with our colleagues from BR Data.

1. Commit, pull, push

I really hate Git. It's the most unintuitive thing in the world. But this project wouldn't have been possible without a version control system: We were roughly four people working on the document parser. The BR colleagues are located in Munich and we are based in Hamburg.

We used R and Python to build the parser's different parts. Our days were a constant commit, pull, push, commit, pull, push. For security reasons, we only pushed the scripts themselves into the Git system. The documents stayed on our local machines and aggregated datasets were only created and saved locally by the parser.

2. Check your OCR settings

I really like Abbyy FineReader. I've used it every now and then over the last years to extract text and spreadsheets from pictures like the ones a non-searchable PDF consists of. This is a process known as OCR – optical character recognition. So I was very happy when OKFN and AlgorithmWatch decided to use FineReader for the uploaded credit reports.

I had never thought about the different OCR settings FineReader offers. Its presets just worked fine for me – until this time. In Schufa's credit report, you have information like dates (when Schufa transmitted a person's score to a company) and percentages (the score itself says how likely it is that a person will meet his or her payment obligations).

FineReader exports documents as Excel files. When we imported these files into our R and Python scripts, we discovered that there were some weird-looking values in the date column like "43183". And in the score column, we sometimes had values in a string format like "97.12%" and sometimes in a float format like "0.9712".

So, in some cases, FineReader must have recognized dates and percentages and converted them into the Excel format: dates are saved as days since January 1st 1900 (it's sad because it's true) and percentages are represented as decimal numbers between 0 and 1.

To avoid that, you have to uncheck the box "Convert numeric values to numbers". This way, the actual values are preserved as strings.

3. Fuzzy matching and regex

From both the tables of transmitted scores (screenshot above) and the financial histories (screenshot below), we wanted to extract specific information.

In a person's financial history, you can find banks reporting the creation of a new account or loan, phone companies reporting a contract, online retailers requesting credit information. For each entry in a person's financial history we wanted to know: What kind of entry is it (bank account, loan, phone contract, credit information request, …)? When was it reported? And by which company?

To identify what kind of entry a paragraph was, we used fuzzy matching with Levenshtein distance. It's a way to check if a certain phrase appears in a text. And you're able to specify how exact the match must be. In OCRed documents you'll find misspellings like "Girakonto" instead of "Girokonto", where the OCR software confused a letter for another. With fuzzy matching you can handle such misspellings.

To extract an entry's date and company, we used regex, short for regular expressions.

This is a type of language widely used to find patterns in text. For example, "." is used as a placeholder for any character, "+" stands for "more than once".

In our documents we had patterns like "On [DATE] did [COMPANY] note that, …" or "[COMPANY] noted that on [DATE] …". Extracting the dates was pretty easy. We used the regex query "(\d{2}[.,]\d{2}[.,]\d{4})". It can be "translated" as: two digits, followed by either a period or a comma (for OCR errors), then two more digits, again a period or a comma and finally four digits.

To extract the companies' names, we had to prepare different regex queries for the different types of entries. Sometimes the name is in a paragraph's beginning, sometimes it comes later. So we used queries like "did (.+) note" and "^(.+) noted that on". The text snippet before and after the companies' names had to be long enough to ensure that no false positives were caught. And they had to be as short as possible to avoid unnecessary OCR errors.

If anyone knows a good Python library that combines fuzzy matching and regex, please let me know.

You can find the full project by SPIEGEL Data 👉 here and the BR Data version 👉 here.


Marcel Pauly

Marcel Pauly leads the data journalism unit of SPIEGEL ONLINE. He graduated from Henri Nannen Journalism School and Columbia's Lede Program.

Runs on:

How many stickers do you have on your laptop?
Only the one that covers the webcam.

How many pie charts have you built?
Mmmmmmhhhhhh pie.

How many times per week do you have to explain what "data journalism" is?
It's getting rarer and rarer.

Swear words per day?

Your funniest file name?
Why_do_Pandas_and_­Matplotlib_­ work_how_­they_do?.ipynb

snow flake
© 2018 Journocode