Meaningful diff that augments data

Ergodiff is an algorithm for generating word-level differences.

What is Ergodiff?

In September 2024, I was researching how to extract temporal semantic information from Wikipedia Revision History, aiming to generate rich datasets for downstream models and data analysis. The challenge was to extract important information while keeping the data clean, small, and meaningful. After three weeks of experimenting and debugging, I built a Python library to handle snippet-level diff extraction—and named it Ergodiff. The library also includes a reconstruction algorithm for extending context windows when needed.

Ergodiff extracts meaningful differences between sentences at a word level rather than a character level, preserving semantic context. For instance, we have two sentences:

Sentence A: "The quick brown fox jumps over the lazy dog."
Sentence B: "A team of quiet brown foxes jump over the lazy dog."

Ideally, the spec of snippet-level diff should be able to extract the following information:

(0, "The quick", "A team of quiet")
(16, "fox jumps", "foxes jump")

The above information means that the snippet "The quick" has been changed to "A team of quiet", and the snippet "fox jumps" has been changed to "foxes jump". The first diff could be explained as a meaning expansion, which could be observed by several reasons, such as situation change (e.g. new champion, new president, new product, etc.) or misspell replacement (e.g. "quick" and "quiet" could be misspellings of each other). The second diff could be explained as a word tense change, which is a common phenomenon in natural language.

This example does not cover the actual downstream usage of this data, but demonstrates the potential direction of using such data in predicting the reasoning of an update. One possible application is to predict the next update of a Wikipedia page based on the situations happened (e.g. news feed) or existing text content. Everything remains unknown at the moment, but it unlocks our lab with a lot of possibilities.

In conclusion, using this approach, Ergodiff improves diff extraction by focusing on ergonomics and data minimalism, making it particularly effective for tasks involving semantic update preservation.

Quick Start

To use Ergodiff, import the module and create an instance:

from ergodiff import Ergodiff

ed = Ergodiff()

The instance is stateless, so you can reuse it as much as needed. Use the ed instance to extract differences:

sentence_a = "The quick brown fox jumps over the lazy dog."
sentence_b = "A team of quiet brown foxes jump over the lazy dog."
ed.get_sentence_diff(sentence_a, sentence_b)

The output of this code snippet looks like this:

[(0, 'The quick', 'A team of quiet'), (16, 'fox jumps', 'foxes jump')]

For more details, see the API References.

Alternative Solutions

If you'd prefer not to add another library to your project, there are other options available.

Using the Default `difflib`

Ergodiff is powered by difflib while optimizing for storage cost and data ergonomics. However, if you want to explore your own approach using the raw data, you can try out difflib directly. It is a powerful built-in Python library.

import difflib

sentence_a = "The quick brown fox jumps over the lazy dog."
sentence_b = "A team of quiet brown foxes jump over the lazy dog."

# Using difflib to find the difference
diff = difflib.ndiff(sentence_a.split(), sentence_b.split())

# Joining the diff into a readable format
diff_result = '\n'.join(diff)

# Printing the result
print(diff_result)

difflib is effective for basic diff operations, but Ergodiff provides a more refined approach tailored for semantic update analysis and dataset construction.

The output of this code snippet looks like this:

- The
- quick
+ A
+ team
+ of
+ quiet
  brown
- fox
+ foxes
?    ++

- jumps
?     -

+ jump
  over
  the
  lazy
  dog.

It is actually a good start, but it still requires a lot of post-processing to extract the differences.

Using `diff-match-patch`

diff-match-patch is another popular library for diff operations. This library uses Myer's diff algorithm to find the diffs, and a semantic diff algorithm to clean up the diffs while improving the performance. I knew it recently from Jeffrey when I was introducing my Blocks Architecture project in EMNLP 2024 workshop. I quickly found out that it is a great library with an amazing algorithm for diff operations, which even covers word-level diffs. It was managed by Google and looks very promising, so I do recommend everyone to give it a try if you need a more comprehensive diff solution.

import diff_match_patch as dmp_module

sentence_a = "The quick brown fox jumps over the lazy dog."
sentence_b = "A team of quiet brown foxes jump over the lazy dog."

dmp = dmp_module.diff_match_patch()
diff = dmp.diff_main(sentence_a, sentence_b)
# See result 1.
dmp.diff_cleanupSemantic(diff)
# See result 2.
print(diff)

Result 1 (after normal diff):

[(-1, 'Th'),
 (1, 'A t'),
 (0, 'e'),
 (1, 'am of'),
 (0, ' qui'),
 (-1, 'ck'),
 (1, 'et'),
 (0, ' brown fox'),
 (1, 'es'),
 (0, ' jump'),
 (-1, 's'),
 (0, ' over the lazy dog.')]

Result 2 (after semantic diff):

[(-1, 'The'),
 (1, 'A team of'),
 (0, ' qui'),
 (-1, 'ck'),
 (1, 'et'),
 (0, ' brown fox'),
 (1, 'es'),
 (0, ' jump'),
 (-1, 's'),
 (0, ' over the lazy dog.')]

This result is much better than the default difflib as it could be used directly without any post-processing. However, the usefulness of this result highly depends on the specific use case. The pros of diff-match-patch are on the reliability and maturity of the algorithm, while the cons are the limitations on the snippet-level diff extraction. There is no single perfect solution in the world, and we should choose the one that best fits our needs instead of relying on a single one to solve all problems. I will also try this library out in the future to see if it could be a better fit for the downstream tasks in our lab.

Disclaimer

Ergodiff is an algorithm based on the default difflib library, and it is not perfect on the accuracy. We have already improved the algorithm to make sure it is almost accurate, but there is still a chance of hitting edge cases before our stable release (v1).

The reconstruction algorithm has been tested and used internally, but not released yet, as the project has been put on hold for a while. I will try to release it but not sure the future of this project at the moment -- at least it works for now.

Theoretically, you can use our experimental reconstruction algorithm like this, but keep in mind that it is still experimental and not fully accurate yet:

from ergodiff import Ergodiff, progressive_reconstruct

example_a = 'Okay, this is added.'
example_b = 'Okay, that is nicely added.'

ed = Ergodiff()
old_sentences, changes, added_lines = ed.get_diff(example_a, example_b)
new_sentences = progressive_reconstruct(old_sentences, changes)

assert new_sentences == ['Okay, that is nicely added.']