# 2023-09-02 Comparing JSON documents JSON became de facto standard for structured documents in modern days. While it may not be used everywhere, most of times we're able to convert a document into JSON and use a vast collection of tools to manipulate data this way. ## What about YAML? I'll focus only on working with JSON, but YAML format is usually preferred to represent JSON compatible schemas in more human readable way. Unless YAML document uses some user-defined types, we can convert it to JSON e.g. using shell function and Python module `ruamel.yaml`: ``` yaml2json () { python3 -c 'from ruamel import yaml; import json, sys; json.dump(yaml.safe_load(sys.stdin), sys.stdout)'; } ``` ## Simple delta That's the easiest case. Let's say that our first document looks like this: ```json { "object1": { "key1": "value1", "key2": "value2" } } ``` and our second object looks like this: ```json { "object1": { "key2": "value2", "key1": "value1" } } ``` When we try to create a delta normal way we get some misleading information: ``` $ diff d1.json d2.json 3,4c3,4 < "key1": "value1", < "key2": "value2" --- > "key2": "value2", > "key1": "value1" ``` It's true that these documents differ in key ordering, but that's not what we care about in structured documents. Fortunately `jq` helps here: ``` $ diff <(jq --sort-keys . d1.json) <(jq --sort-keys . d2.json) $ ``` Now we're comparing actual content, not the formatting of documents. ## Documents with changes in objects and arrays Simple delta will work great for documents where only simple values differ. Delta becomes less readable the more complex structures we're comparing. Let's say our first document is now: ```json { "object1": { "key1": "value1", "key2": "value2" }, "object3": { "key1": [ "value1", "value2", [ { "key1": "value1" }, { "key1": "value1" } ] ] } } ``` and our second document is: ```json { "object2": { "key1": "value1", "key2": "value2" }, "object3": { "key3": [ "value3", "value5", [ { "key1": "value2" } ] ] }, "array1": [ "value1" ] } ``` Using `diff` with `jq` still works, but now delta also contains lines which doesn't introduce any value and are only required to ensure document validity like curly opening and closing braces: ``` $ diff <(jq --sort-keys . d1.json) <(jq --sort-keys . d2.json) 2c2,5 < "object1": { --- > "array1": [ > "value1" > ], > "object2": { 7,9c10,12 < "key1": [ < "value1", < "value2", --- > "key3": [ > "value3", > "value5", 12,15c15 < "key1": "value1" < }, < { < "key1": "value1" --- > "key1": "value2" ``` ### Converting JSON into a "flat" form In order to focus only on actual differences between documents and not care about how they are formatted we will first produce a flat version of JSON where each line corresponds to exactly one value. `jq` allows to write longer scripts and reference them in invocations. We're going to use the following: ``` paths as $path | getpath($path) as $value | ($value | type) as $type | select( ( ($type == "object" and $value != {}) or ($type == "array" and $value != []) ) | not ) | [([$path[] | tostring] | join(".")), $value] ``` It basically traverses all paths in document and outputs their value, unless they are non-empty complex types. Let's have it convert an example document like: ```json { "empty_object": {}, "empty_array": [], "array": [ "value1", "value2" ], "object": { "key1": "value1\nvalue1.1", "key2": "value2" } } ``` We will use `-c` option to `jq` in order to ensure that each key-value pair actually takes one line: ``` $ cat d3.json | jq -cf flat.jq ["empty_object",{}] ["empty_array",[]] ["array.0","value1"] ["array.1","value2"] ["object.key1","value1\nvalue1.1"] ["object.key2","value2"] ``` Now we can use this `jq` script to compare more complex documents: ``` $ diff <(jq -cf flat.jq d1.json | sort) <(jq -cf flat.jq d2.json | sort) 1,6c1,6 < ["object1.key1","value1"] < ["object1.key2","value2"] < ["object3.key1.0","value1"] < ["object3.key1.1","value2"] < ["object3.key1.2.0.key1","value1"] < ["object3.key1.2.1.key1","value1"] --- > ["array1.0","value1"] > ["object2.key1","value1"] > ["object2.key2","value2"] > ["object3.key3.0","value3"] > ["object3.key3.1","value5"] > ["object3.key3.2.0.key1","value2"]>> ``` ## Showing only the common parts of documents Let's say we have document: ```json { "object1": { "key1": "value1", "key2": "value2.1" }, "array1": [ "value1", "value2", "value3", "value4" ] } ``` and: ```json { "object1": { "key1": "value1", "key2": "value2" }, "array1": [ "value1", "value2", "value4" ] } ``` `comm` tool will help us here: ``` $ comm -1 -2 <(jq -cf flat.jq d1.json | sort) <(jq -cf flat.jq d2.json | sort) ["array1.0","value1"] ["array1.1","value2"] ["object1.key1","value1"] ``` One way to use this technique is, if we have default values files and overrides files and want to see which keys are redundant in the latter.