JSON became de facto standard for structured documents in modern days. While it may not be used everywhere, most of times we're able to convert a document into JSON and use a vast collection of tools to manipulate data this way.
I'll focus only on working with JSON, but YAML format is usually preferred to
represent JSON compatible schemas in more human readable way. Unless YAML
document uses some user-defined types, we can convert it to JSON e.g. using
shell function and Python module ruamel.yaml
:
yaml2json () { python3 -c 'from ruamel import yaml; import json, sys; json.dump(yaml.safe_load(sys.stdin), sys.stdout)'; }
That's the easiest case. Let's say that our first document looks like this:
{
"object1": {
"key1": "value1",
"key2": "value2"
}
}
and our second object looks like this:
{
"object1": {
"key2": "value2",
"key1": "value1"
}
}
When we try to create a delta normal way we get some misleading information:
$ diff d1.json d2.json
3,4c3,4
< "key1": "value1",
< "key2": "value2"
---
> "key2": "value2",
> "key1": "value1"
It's true that these documents differ in key ordering, but that's not what we
care about in structured documents. Fortunately jq
helps here:
$ diff <(jq --sort-keys . d1.json) <(jq --sort-keys . d2.json)
$
Now we're comparing actual content, not the formatting of documents.
Simple delta will work great for documents where only simple values differ. Delta becomes less readable the more complex structures we're comparing. Let's say our first document is now:
{
"object1": {
"key1": "value1",
"key2": "value2"
},
"object3": {
"key1": [
"value1",
"value2",
[
{
"key1": "value1"
},
{
"key1": "value1"
}
]
]
}
}
and our second document is:
{
"object2": {
"key1": "value1",
"key2": "value2"
},
"object3": {
"key3": [
"value3",
"value5",
[
{
"key1": "value2"
}
]
]
},
"array1": [
"value1"
]
}
Using diff
with jq
still works, but now delta also contains lines which
doesn't introduce any value and are only required to ensure document validity
like curly opening and closing braces:
$ diff <(jq --sort-keys . d1.json) <(jq --sort-keys . d2.json)
2c2,5
< "object1": {
---
> "array1": [
> "value1"
> ],
> "object2": {
7,9c10,12
< "key1": [
< "value1",
< "value2",
---
> "key3": [
> "value3",
> "value5",
12,15c15
< "key1": "value1"
< },
< {
< "key1": "value1"
---
> "key1": "value2"
In order to focus only on actual differences between documents and not care
about how they are formatted we will first produce a flat version of JSON where
each line corresponds to exactly one value. jq
allows to write longer scripts
and reference them in invocations. We're going to use the following:
paths as $path
| getpath($path) as $value
| ($value | type) as $type
| select(
(
($type == "object" and $value != {})
or ($type == "array" and $value != [])
)
| not
)
| [([$path[] | tostring] | join(".")), $value]
It basically traverses all paths in document and outputs their value, unless they are non-empty complex types. Let's have it convert an example document like:
{
"empty_object": {},
"empty_array": [],
"array": [
"value1",
"value2"
],
"object": {
"key1": "value1\nvalue1.1",
"key2": "value2"
}
}
We will use -c
option to jq
in order to ensure that each key-value pair
actually takes one line:
$ cat d3.json | jq -cf flat.jq
["empty_object",{}]
["empty_array",[]]
["array.0","value1"]
["array.1","value2"]
["object.key1","value1\nvalue1.1"]
["object.key2","value2"]
Now we can use this jq
script to compare more complex documents:
$ diff <(jq -cf flat.jq d1.json | sort) <(jq -cf flat.jq d2.json | sort)
1,6c1,6
< ["object1.key1","value1"]
< ["object1.key2","value2"]
< ["object3.key1.0","value1"]
< ["object3.key1.1","value2"]
< ["object3.key1.2.0.key1","value1"]
< ["object3.key1.2.1.key1","value1"]
---
> ["array1.0","value1"]
> ["object2.key1","value1"]
> ["object2.key2","value2"]
> ["object3.key3.0","value3"]
> ["object3.key3.1","value5"]
> ["object3.key3.2.0.key1","value2"]>>
Let's say we have document:
{
"object1": {
"key1": "value1",
"key2": "value2.1"
},
"array1": [
"value1",
"value2",
"value3",
"value4"
]
}
and:
{
"object1": {
"key1": "value1",
"key2": "value2"
},
"array1": [
"value1",
"value2",
"value4"
]
}
comm
tool will help us here:
$ comm -1 -2 <(jq -cf flat.jq d1.json | sort) <(jq -cf flat.jq d2.json | sort)
["array1.0","value1"]
["array1.1","value2"]
["object1.key1","value1"]
One way to use this technique is, if we have default values files and overrides files and want to see which keys are redundant in the latter.