Loading Dirty JSON With Python

Recently I needed to parse some data embedded in HTML. At first glance it appeared to be JSON, so after pulling the text out of the HTML using BeautifulSoup, I tried to load it using the json module, however this immediately threw an error:

ValueError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

This is because,  despite first appearances, the data I was trying  to extract was a python object built from strings, lists, integers, floats, and dictionaries which had been passed to the ‘print’ statement. But it was quite close to JSON so I decided that the best course of action in this instance was to ‘fix’ the data so that I could load it as JSON.

First, as the error above indicates, double quotes are required, not the single quotes mostly (but not always prefixed with a ‘u’  (indicating unicode) which my data had.

After removing these I encountered the error:

ValueError: No JSON object could be decoded

This thoroughly unhelpful error sent me scurrying to Google. Apparently this error is thrown in a variety of situations, but the one relevant to my data was the case of the boolean key words (True and False) in python they are capitalised, but in JSON they need to be lowercase. (This error is also thrown when there are trailing commas in lists).

I used regular expression substitution to implement these alterations. I decided to share these few lines of code for my future self and anyone else who may find it useful. (Note that this worked for my use case, but as soon as exceptions stopped being thrown I moved on. Therefore it may not be a robust or complete solution. You have been warned.)

import re
import json

def load_dirty_json(dirty_json):
    regex_replace = [(r"([ \{,:\[])(u)?'([^']+)'", r'\1"\3"'), (r" False([, \}\]])", r' false\1'), (r" True([, \}\]])", r' true\1')]
    for r, s in regex_replace:
        dirty_json = re.sub(r, s, dirty_json)
    clean_json = json.loads(dirty_json)
    return clean_json

Leave a Reply

Your email address will not be published. Required fields are marked *