Somewhat related question. Assume I have ~1k ~200MB XML files that get ~20% of t...

adobrawy · on Oct 25, 2024

I don't know what your "best" criterion is (implementation costs, implementation time, maintainability, performance, compression ratio, etc.). Still, the easiest way to start is to delegate it to the file system, so zfs + compression. Access time should be decent. No application-level changes are required to enable that.

betaby · on Oct 25, 2024

It is already on ZFS with compression.

a_e_k · on Oct 26, 2024

Crazy thought: something like BorgBackup.

If only 20% of the content gets changed, the rolling hash that Borg does to chunk files could identify the 80% common parts and then with its deduplication it would store just a single compressed copy of those chunks. And as a bonus, it's designed for handling historical data.

nomel · on Oct 25, 2024

If you can share, but I'd be curious to know what that large of an XML file might be used for, and what benefits it might have over other formats. My persona and professional use of XML has been pretty limited, but XSD was super powerful, and the reason we choose it when we did.

betaby · on Oct 25, 2024

Juniper routers configs, something like below.

adamc@router> show arp | display xml <rpc-reply xmlns:JUNOS="http://xml.juniper.net/JUNOS/15.1F6/JUNOS"> <arp-table-information xmlns="http://xml.juniper.net/JUNOS/15.1F6/JUNOS-arp" JUNOS:style="normal"> <arp-table-entry> <mac-address>0a:00:27:00:00:00</mac-address> <ip-address>10.0.201.1</ip-address> <hostname>adamc-mac</hostname> <interface-name>em0.0</interface-name> <arp-table-entry-flags> <none/> </arp-table-entry-flags> </arp-table-entry> </arp-table-information> <cli> <banner></banner> </cli> </rpc-reply>

hobs · on Oct 25, 2024

it's a good question because my answer for a system like this which had very little schema changing was just dump it into a database and add historical tracking per object that way, hash, compare, insert and add historical record.

betaby · on Oct 25, 2024

I do have the current state in the DB. However I need sometimes to compare today's file with the one from 6 month ago.

hobs · on Oct 25, 2024

So I assumed something like - you have the same schema with the same tabular format inside or the XML document, and that those state changes are in a way so you can tell the timestamp - then you can bring up both states at the same time and compare across the attributes for wrongness.

EXCEPT/INTERSECT make this easy for a bunch of columns (excluding the times of course, I usually hash these for performance reasons) but wont tell you what itself is the difference, you have to do column by column comparisons here, which is where I usually shell out to my language of choice because SQL sucks at doing that.

o11c · on Oct 25, 2024

I'm not sure if this quite fits your workload, but a lot of times people use `git` when `casync` would be more appropriate.

hokkos · on Oct 25, 2024

You can compress in EXI, it's a format for XML and if it is informed by the schema can give a big boost in compression.

tln · on Oct 25, 2024

> get ~20% of their content changed

...daily? monthly? how many versions do you have to keep around?

I'd look at a simple zstd dictionary based scheme, first. Put your history/metadata into a database. Put the XML data into file system/S3/BackBlaze/B2, zstd compressed against a dictionary.

Create the dictionary : zstd --train PathToTrainingSet/* -o dictionaryName Compress with the dictionary: zstd FILE -D dictionaryName Decompress with the dictionary: zstd --decompress FILE.zst -D dictionaryName

Although you say you're fine with it being not that storage efficient to a degree, I think if you were OK with storing every version of every XML file, uncompressed, you wouldn't have to ask right?

betaby · on Oct 25, 2024

If one stores a whole versions of the files that defeats the idea of git, and would consume too much space. I suppose I don't even need zstd if I have ZFS with compression, although compression levels won't be as good.

tln · on Oct 25, 2024

You're relying on compression either way... my hunch is that controlling the compression yourself may get you a better result.

Git does not store diffs, it stores every version. These get compressed into packfiles https://git-scm.com/book/en/v2/Git-Internals-Packfiles. It looks like it uses zlib.