geowatch.utils.util_stringalgo module¶
pip install pygtrie
- geowatch.utils.util_stringalgo.shortest_unique_prefixes(items, sep=None, allow_simple=True, min_length=0, allow_end=False)[source]¶
The shortest unique prefix algorithm.
- Parameters:
items (List[str]) – returned prefixes will be unique wrt this set
sep (str) – if specified, all characters between separators are treated as a single symbol. Makes the algo much faster.
allow_simple (bool) – if True tries to construct a simple feasible solution before resorting to the optimal trie algorithm.
min_length (int) – minimum length each prefix can be
allow_end (bool) – if True allows for string terminators to be considered in the prefix
- Returns:
- a prefix for each item that uniquely identifies it
wrt to the original items.
- Return type:
References
http://www.geeksforgeeks.org/find-all-shortest-unique-prefixes-to-represent-each-word-in-a-given-list/ https://github.com/Briaares/InterviewBit/blob/master/Level6/Shortest%20Unique%20Prefix.cpp
- Requires:
pip install pygtrie
Example
>>> # xdoctest: +REQUIRES(module:pygtrie) >>> items = ["zebra", "dog", "duck", "dove"] >>> shortest_unique_prefixes(items) ['z', 'dog', 'du', 'dov']
Example
>>> # xdoctest: +REQUIRES(module:pygtrie) >>> from geowatch.utils.util_stringalgo import * # NOQA >>> smeti = ["params.metrics.foo.mean", "params.metrics.foo.std", "params.metrics.foo.count"] >>> items = [p[::-1] for p in smeti] >>> euqine = shortest_unique_prefixes(items, sep='.', min_length=2) >>> unique = [p[::-1] for p in euqine] >>> print(f'unique={unique}') unique=['foo.mean', 'foo.std', 'foo.count']
- Timeing:
>>> # DISABLE_DOCTEST >>> # make numbers larger to stress test >>> # L = max length of a string, N = number of strings, >>> # C = smallest gaurenteed common length >>> # (the setting N=10000, L=100, C=20 is feasible we are good) >>> import timerit >>> import random >>> def make_data(N, L, C): >>> rng = random.Random(0) >>> return [''.join(['a' if i < C else chr(rng.randint(97, 122)) >>> for i in range(L)]) for _ in range(N)] >>> items = make_data(N=1000, L=10, C=0) >>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print() Timed for: 3 loops, best of 3 time per loop: best=24.54 ms, mean=24.54 ± 0.0 ms >>> items = make_data(N=1000, L=100, C=0) >>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print() Timed for: 3 loops, best of 3 time per loop: best=155.4 ms, mean=155.4 ± 0.0 ms >>> items = make_data(N=1000, L=100, C=70) >>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print() Timed for: 3 loops, best of 3 time per loop: best=232.8 ms, mean=232.8 ± 0.0 ms >>> items = make_data(N=10000, L=250, C=20) >>> timerit.Timerit(3).call(shortest_unique_prefixes, items).print() Timed for: 3 loops, best of 3 time per loop: best=4.063 s, mean=4.063 ± 0.0 s
- geowatch.utils.util_stringalgo.shortest_unique_suffixes(items, sep=None, min_length=0)[source]¶
Example
>>> # xdoctest: +REQUIRES(--pygtrie) >>> items = ["zebra", "dog", "duck", "dove"] >>> shortest_unique_suffixes(items) ['a', 'g', 'k', 'e']
Example
>>> # xdoctest: +REQUIRES(--pygtrie) >>> items = ["aa/bb/cc", "aa/bb/bc", "aa/bb/dc", "aa/cc/cc"] >>> shortest_unique_suffixes(items)