geowatch.utils.util_stringalgo module¶

pip install pygtrie

geowatch.utils.util_stringalgo.shortest_unique_prefixes(items, sep=None, allow_simple=True, min_length=0, allow_end=False)[source]¶

The shortest unique prefix algorithm.

Parameters:

items (List[str]) – returned prefixes will be unique wrt this set
sep (str) – if specified, all characters between separators are treated as a single symbol. Makes the algo much faster.
allow_simple (bool) – if True tries to construct a simple feasible solution before resorting to the optimal trie algorithm.
min_length (int) – minimum length each prefix can be
allow_end (bool) – if True allows for string terminators to be considered in the prefix

Returns:

a prefix for each item that uniquely identifies it: wrt to the original items.

Return type:

list of str

References

http://www.geeksforgeeks.org/find-all-shortest-unique-prefixes-to-represent-each-word-in-a-given-list/ https://github.com/Briaares/InterviewBit/blob/master/Level6/Shortest%20Unique%20Prefix.cpp

Requires:: pip install pygtrie

Example

>>> # xdoctest: +REQUIRES(module:pygtrie)
>>> items = ["zebra", "dog", "duck", "dove"]
>>> shortest_unique_prefixes(items)
['z', 'dog', 'du', 'dov']

Example

>>> # xdoctest: +REQUIRES(module:pygtrie)
>>> from geowatch.utils.util_stringalgo import *  # NOQA
>>> smeti = ["params.metrics.foo.mean", "params.metrics.foo.std", "params.metrics.foo.count"]
>>> items = [p[::-1] for p in smeti]
>>> euqine = shortest_unique_prefixes(items, sep='.', min_length=2)
>>> unique = [p[::-1] for p in euqine]
>>> print(f'unique={unique}')
unique=['foo.mean', 'foo.std', 'foo.count']

Timeing:

>>> # DISABLE_DOCTEST
>>> # make numbers larger to stress test
>>> # L = max length of a string, N = number of strings,
>>> # C = smallest gaurenteed common length
>>> # (the setting N=10000, L=100, C=20 is feasible we are good)
>>> import timerit
>>> import random
>>> def make_data(N, L, C):
>>>     rng = random.Random(0)
>>>     return [''.join(['a' if i < C else chr(rng.randint(97, 122))
>>>                      for i in range(L)]) for _ in range(N)]
>>> items = make_data(N=1000, L=10, C=0)
>>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print()
Timed for: 3 loops, best of 3
    time per loop: best=24.54 ms, mean=24.54 ± 0.0 ms
>>> items = make_data(N=1000, L=100, C=0)
>>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print()
Timed for: 3 loops, best of 3
    time per loop: best=155.4 ms, mean=155.4 ± 0.0 ms
>>> items = make_data(N=1000, L=100, C=70)
>>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print()
Timed for: 3 loops, best of 3
    time per loop: best=232.8 ms, mean=232.8 ± 0.0 ms
>>> items = make_data(N=10000, L=250, C=20)
>>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print()
Timed for: 3 loops, best of 3
    time per loop: best=4.063 s, mean=4.063 ± 0.0 s

geowatch.utils.util_stringalgo.shortest_unique_suffixes(items, sep=None, min_length=0)[source]¶

Example

>>> # xdoctest: +REQUIRES(--pygtrie)
>>> items = ["zebra", "dog", "duck", "dove"]
>>> shortest_unique_suffixes(items)
['a', 'g', 'k', 'e']

Example

>>> # xdoctest: +REQUIRES(--pygtrie)
>>> items = ["aa/bb/cc", "aa/bb/bc", "aa/bb/dc", "aa/cc/cc"]
>>> shortest_unique_suffixes(items)