geowatch.utils.util_pandas module

class geowatch.utils.util_pandas.DataFrame(data=None, index: Axes | None = None, columns: Axes | None = None, dtype: Dtype | None = None, copy: bool | None = None)[source]

Bases: DataFrame

Extension of pandas dataframes with quality-of-life improvements.

Refernces:

Example

from geowatch.utils.util_pandas import * # NOQA from geowatch.utils import util_pandas df = util_pandas.DataFrame.random()

classmethod random(rows=10, columns='abcde', rng=None)[source]

Create a random data frame for testing.

rows=10 columns=’abcde’ rng = None cls = util_pandas.DataFrame

safe_drop(labels, axis=0)[source]

Like self.drop(), but does not error if the specified labels do not exist.

Parameters:
  • df (pd.DataFrame) – df

  • labels (List) – …

  • axis (int) – todo

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> self = DataFrame({k: np.random.rand(10) for k in 'abcde'})
>>> self.safe_drop(list('bdf'), axis=1)
reorder(head=None, tail=None, axis=0, missing='error', fill_value=nan, **kwargs)[source]

Change the order of the row or column index. Unspecified labels will keep their existing order after the specified labels.

Parameters:
  • head (List | None) – The order of the labels to put at the start of the re-indexed data frame. Unspecified labels keep their relative order and are placed after specified these “head” labels.

  • tail (List | None) – The order of the labels to put at the end of the re-indexed data frame. Unspecified labels keep their relative order and are placed after before these “tail” labels.

  • axis (int) – The axis 0 for rows, 1 for columns to reorder.

  • missing (str) – Policy to handle specified labels that do not exist in the specified axies. Can be either “error”, “drop”, or “fill”. If “drop”, then drop any specified labels that do not exist. If “error”, then raise an error non-existing labels are given. If “fill”, then fill in values for labels that do not exist.

  • fill_value (Any) – fill value to use when missing is “fill”.

Returns:

Self - DataFrame with modified indexes

Example

>>> from geowatch.utils import util_pandas
>>> self = util_pandas.DataFrame.random(rows=5, columns=['a', 'b', 'c', 'd', 'e', 'f'])
>>> new = self.reorder(['b', 'c'], axis=1)
>>> assert list(new.columns) == ['b', 'c', 'a', 'd', 'e', 'f']
>>> # Set the order of the first and last of the columns
>>> new = self.reorder(head=['b', 'c'], tail=['e', 'd'], axis=1)
>>> assert list(new.columns) == ['b', 'c', 'a', 'f', 'e', 'd']
>>> # Test reordering the rows
>>> new = self.reorder([1, 0], axis=0)
>>> assert list(new.index) == [1, 0, 2, 3, 4]
>>> # Test reordering with a non-existent column
>>> new = self.reorder(['q'], axis=1, missing='drop')
>>> assert list(new.columns) == ['a', 'b', 'c', 'd', 'e', 'f']
>>> new = self.reorder(['q'], axis=1, missing='fill')
>>> assert list(new.columns) == ['q', 'a', 'b', 'c', 'd', 'e', 'f']
>>> import pytest
>>> with pytest.raises(ValueError):
>>>     self.reorder(['q'], axis=1, missing='error')
>>> # Should error if column is given in both head and tail
>>> with pytest.raises(ValueError):
>>>     self.reorder(['c'], ['c'], axis=1, missing='error')
groupby(by=None, **kwargs)[source]

Fixed groupby behavior so length-one arguments are handled correctly

Parameters:
  • df (DataFrame)

  • ** kwargs – groupby kwargs

Example

>>> from geowatch.utils import util_pandas
>>> df = util_pandas.DataFrame({
>>>     'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
>>>     'Color': ['Blue', 'Blue', 'Blue', 'Yellow'],
>>>     'Max Speed': [380., 370., 24., 26.]
>>>     })
>>> new1 = dict(list(df.groupby(['Animal', 'Color'])))
>>> new2 = dict(list(df.groupby(['Animal'])))
>>> new3 = dict(list(df.groupby('Animal')))
>>> assert sorted(new1.keys())[0] == ('Falcon', 'Blue')
>>> assert sorted(new3.keys())[0] == 'Falcon'
>>> # This is the case that is fixed.
>>> assert sorted(new2.keys())[0] == ('Falcon',)
match_columns(pat, hint='glob')[source]

Find matching columns in O(N)

search_columns(pat, hint='glob')[source]

Find matching columns in O(N)

varied_values(**kwargs)[source]
SeeAlso:

geowatch.utils.result_analysis.varied_values()

varied_value_counts(**kwargs)[source]
SeeAlso:

geowatch.utils.result_analysis.varied_value_counts()

geowatch.utils.util_pandas.pandas_reorder_columns(df, columns)[source]
geowatch.utils.util_pandas.pandas_argmaxima(data, columns, k=1)[source]

Finds the top K indexes for given columns.

Parameters:
  • data – pandas data frame

  • columns – columns to maximize. If multiple are given, then secondary columns are used as tiebreakers.

  • k – number of top entries

Returns:

indexes into subset of data that are in the top k for any of the

requested columns.

Return type:

List

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame({k: np.random.rand(10) for k in 'abcde'})
>>> columns = ['b', 'd', 'e']
>>> k = 1
>>> top_indexes = pandas_argmaxima(data=data, columns=columns, k=k)
>>> assert len(top_indexes) == k
>>> print(data.loc[top_indexes])
geowatch.utils.util_pandas.pandas_suffix_columns(data, suffixes)[source]

Return columns that end with this suffix

geowatch.utils.util_pandas.pandas_nan_eq(a, b)[source]
geowatch.utils.util_pandas.pandas_shorten_columns(summary_table, return_mapping=False, min_length=0)[source]

Shorten column names

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> df = pd.DataFrame([
>>>     {'param_hashid': 'badbeaf', 'metrics.eval.f1': 0.9, 'metrics.eval.mcc': 0.8, 'metrics.eval.acc': 0.3},
>>>     {'param_hashid': 'decaf', 'metrics.eval.f1': 0.6, 'metrics.eval.mcc': 0.2, 'metrics.eval.acc': 0.4},
>>>     {'param_hashid': 'feedcode', 'metrics.eval.f1': 0.5, 'metrics.eval.mcc': 0.3, 'metrics.eval.acc': 0.1},
>>> ])
>>> print(df.to_string(index=0))
>>> df2 = pandas_shorten_columns(df)
param_hashid  metrics.eval.f1  metrics.eval.mcc  metrics.eval.acc
     badbeaf              0.9               0.8               0.3
       decaf              0.6               0.2               0.4
    feedcode              0.5               0.3               0.1
>>> print(df2.to_string(index=0))
param_hashid  f1  mcc  acc
     badbeaf 0.9  0.8  0.3
       decaf 0.6  0.2  0.4
    feedcode 0.5  0.3  0.1

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> df = pd.DataFrame([
>>>     {'param_hashid': 'badbeaf', 'metrics.eval.f1.mean': 0.9, 'metrics.eval.f1.std': 0.8},
>>>     {'param_hashid': 'decaf', 'metrics.eval.f1.mean': 0.6, 'metrics.eval.f1.std': 0.2},
>>>     {'param_hashid': 'feedcode', 'metrics.eval.f1.mean': 0.5, 'metrics.eval.f1.std': 0.3},
>>> ])
>>> df2 = pandas_shorten_columns(df, min_length=2)
>>> print(df2.to_string(index=0))
param_hashid  f1.mean  f1.std
     badbeaf      0.9     0.8
       decaf      0.6     0.2
    feedcode      0.5     0.3
geowatch.utils.util_pandas.pandas_condense_paths(colvals)[source]

Condense a column of paths to keep only the shortest distinguishing suffixes

Parameters:

colvals (pd.Series) – a column containing paths to condense

Returns:

the condensed series and a mapping from old to new

Return type:

Tuple

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> rows = [
>>>     {'path1': '/path/to/a/file1'},
>>>     {'path1': '/path/to/a/file2'},
>>> ]
>>> colvals = pd.DataFrame(rows)['path1']
>>> pandas_condense_paths(colvals)
geowatch.utils.util_pandas.pandas_truncate_items(data, paths=False, max_length=16)[source]

from geowatch.utils.util_pandas import pandas_truncate_items

Parameters:

data (pd.DataFrame) – data frame to truncate

Returns:

Tuple[pd.DataFrame, Dict[str, str]]

class geowatch.utils.util_pandas.DotDictDataFrame(*args, **kw)[source]

Bases: DataFrame

A proof-of-concept wrapper around pandas that lets us walk down the nested structure a little easier.

The API is a bit weird, and the caches are not invalidated if any column changes, but it does a reasonable job otherwise.

Is there another library out there that does this?

SeeAlso:

DotDict

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> rows = [
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>> ]
>>> self = DotDictDataFrame(rows)
>>> # Test prefix lookup
>>> assert set(self['node1'].columns) == {'node1.id', 'node1.metrics.ap'}
>>> # Test suffix lookup
>>> assert set(self['id'].columns) == {'node1.id', 'node2.id'}
>>> # Test mid-node lookup
>>> assert set(self['metrics'].columns) == {'node1.metrics.ap', 'node2.metrics.ap'}
>>> # Test single lookup
>>> assert set(self[['node1.id']].columns) == {'node1.id'}
>>> # Test glob
>>> assert set(self.find_columns('*metri*')) == {'node1.metrics.ap', 'node2.metrics.ap'}
property nested_columns
find_column(col)[source]
query_column(col)[source]
lookup_suffix_columns(col)[source]
lookup_prefix_columns(col)[source]
find_columns(pat, hint='glob')[source]
match_columns(pat, hint='glob')[source]
search_columns(pat, hint='glob')[source]
subframe(key, drop_prefix=True)[source]

Given a prefix key, return the subet columns that match it with the stripped prefix.

geowatch.utils.util_pandas.pandas_add_prefix(data, prefix)[source]
geowatch.utils.util_pandas.aggregate_columns(df, aggregator=None, fallback='const', nonconst_policy='error')[source]

Aggregates parameter columns based on per-column strategies / functions specified in aggregator.

Parameters:
  • hash_cols (None | List[str]) – columns whos values should be hashed together.

  • aggregator (Dict[str, str | callable]) – a dictionary mapping column names to a callable function that should be used to aggregate them. There a special string codes that we accept as well. Special functions are: hist, hash, min-max, const,

  • fallback (str | callable) – Aggregator function for any column without an explicit aggregator. Defaults to “const”, which passes one value from the columns through if they are constant. If they are not constant, the nonconst-policy is triggered.

  • nonconst_policy (str) – Behavior when the aggregator is “const”, but the input is non-constant. The policies are:

    • ‘error’ - error if unhandled non-uniform columns exist

    • ‘drop’ - remove unhandled non-uniform columns

Returns:

pd.Series

Todo

  • [ ] optimize this

CommandLine

xdoctest -m geowatch.utils.util_pandas aggregate_columns

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> num_rows = 10
>>> columns = {
>>>     'nums1': np.random.rand(num_rows),
>>>     'nums2': np.random.rand(num_rows),
>>>     'nums3': (np.random.rand(num_rows) * 10).astype(int),
>>>     'nums4': (np.random.rand(num_rows) * 10).astype(int),
>>>     'cats1': np.random.randint(0, 3, num_rows),
>>>     'cats2': np.random.randint(0, 3, num_rows),
>>>     'cats3': np.random.randint(0, 3, num_rows),
>>>     'const1': ['a'] * num_rows,
>>>     'strs1': [np.random.choice(list('abc')) for _ in range(num_rows)],
>>> }
>>> df = pd.DataFrame(columns)
>>> aggregator = ub.udict({
>>>     'nums1': 'mean',
>>>     'nums2': 'max',
>>>     'nums3': 'min-max',
>>>     'nums4': 'stats',
>>>     'cats1': 'histogram',
>>>     'cats3': 'first',
>>>     'cats2': 'hash12',
>>>     'strs1': 'hash12',
>>> })
>>> #
>>> # Test that the const fallback works
>>> row = aggregate_columns(df, aggregator, fallback='const')
>>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1)))
>>> assert row['const1'] == 'a'
>>> row = aggregate_columns(df.iloc[0:1], aggregator, fallback='const')
>>> assert row['const1'] == 'a'
>>> #
>>> # Test that the drop fallback workds
>>> row = aggregate_columns(df, aggregator, fallback='drop')
>>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1)))
>>> assert 'const1' not in row
>>> row = aggregate_columns(df.iloc[0:1], aggregator, fallback='drop')
>>> assert 'const1' not in row
>>> #
>>> # Test that non-constant policy triggers
>>> aggregator_ = aggregator - {'cats3'}
>>> import pytest
>>> with pytest.raises(NonConstantError):
>>>     row = aggregate_columns(df, aggregator_, nonconst_policy='error')
>>> row = aggregate_columns(df, aggregator_, nonconst_policy='drop')
>>> assert 'cats3' not in row
>>> row = aggregate_columns(df, aggregator_, nonconst_policy='hash')
>>> assert 'cats3' in row
>>> #
>>> # Test an empty dataframe returns an empty series
>>> row = aggregate_columns(df.iloc[0:0], aggregator)
>>> assert len(row) == 0
>>> #
>>> # Test single column cases work fine.
>>> for col in df.columns:
...     subdf = df[[col]]
...     subagg = aggregate_columns(subdf, aggregator, fallback='const')
...     assert len(subagg) == 1
>>> #
>>> # Test single column drop case works
>>> subagg = aggregate_columns(df[['cats3']], aggregator_, fallback='const', nonconst_policy='drop')
>>> assert len(subagg) == 0
>>> subagg = aggregate_columns(df[['cats3']], aggregator_, fallback='drop')
>>> assert len(subagg) == 0

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> num_rows = 10
>>> columns = {
>>>     'dates': ['2101-01-01', '1970-01-01', '2000-01-01'],
>>>     'lists': [['a'], ['a', 'b'], []],
>>>     'nums':  [1, 2, 3],
>>> }
>>> df = pd.DataFrame(columns)
>>> aggregator = ub.udict({
>>>     'dates': 'min-max',
>>>     'lists': 'hash',
>>>     'nums':  'mean',
>>> })
>>> row = aggregate_columns(df, aggregator)
>>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1)))

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> num_rows = 10
>>> columns = {
>>>     'items': [['a'], ['bcd', 'ef'], [], ['3', '234', '2343']],
>>> }
>>> df = pd.DataFrame(columns)
>>> row = aggregate_columns(df, 'last', fallback='const')
>>> columns = {
>>>     'items': ['a', 'c', 'c', 'd'],
>>>     'items2': [['a'], ['bcd', 'ef'], [], ['3', '234', '2343']],
>>> }
>>> df = pd.DataFrame(columns)
>>> row = aggregate_columns(df, 'unique')
class geowatch.utils.util_pandas.SpecialAggregators[source]

Bases: object

hash()[source]
hash12()[source]
unique()[source]
min_max()[source]
static normalize_special_key(k)[source]
special_lut = {'first': <function SpecialAggregators.<lambda>>, 'hash': <function SpecialAggregators.hash>, 'hash12': <function SpecialAggregators.hash12>, 'hist': <function dict_hist>, 'histogram': <function dict_hist>, 'last': <function SpecialAggregators.<lambda>>, 'min_max': <function SpecialAggregators.min_max>, 'stats': <function stats_dict>, 'unique': <function SpecialAggregators.unique>}
exception geowatch.utils.util_pandas.NonConstantError[source]

Bases: ValueError

geowatch.utils.util_pandas.nan_eq(a, b)[source]
class geowatch.utils.util_pandas.GroupbyFutureWrapper[source]

Bases: ObjectProxy

Wraps a groupby object to get the new behavior sooner.

geowatch.utils.util_pandas.pandas_fixed_groupby(df, by=None, **kwargs)[source]

Fixed groupby behavior so length-one arguments are handled correctly

Parameters:
  • df (DataFrame)

  • ** kwargs – groupby kwargs

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> df = pd.DataFrame({
>>>     'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
>>>     'Color': ['Blue', 'Blue', 'Blue', 'Yellow'],
>>>     'Max Speed': [380., 370., 24., 26.]
>>>     })
>>> # Old behavior
>>> old1 = dict(list(df.groupby(['Animal', 'Color'])))
>>> old2 = dict(list(df.groupby(['Animal'])))
>>> old3 = dict(list(df.groupby('Animal')))
>>> new1 = dict(list(pandas_fixed_groupby(df, ['Animal', 'Color'])))
>>> new2 = dict(list(pandas_fixed_groupby(df, ['Animal'])))
>>> new3 = dict(list(pandas_fixed_groupby(df, 'Animal')))
>>> assert sorted(new1.keys())[0] == ('Falcon', 'Blue')
>>> assert sorted(old1.keys())[0] == ('Falcon', 'Blue')
>>> assert sorted(new3.keys())[0] == 'Falcon'
>>> assert sorted(old3.keys())[0] == 'Falcon'
>>> # This is the case that is fixed.
>>> assert sorted(new2.keys())[0] == ('Falcon',)
>>> import numpy as np
>>> if np.lib.NumpyVersion(pd.__version__) < '2.0.0':
>>>     assert sorted(old2.keys())[0] == 'Falcon'