geowatch.utils.util_pandas module¶

class geowatch.utils.util_pandas.DataFrame(data=None, index: Axes | None = None, columns: Axes | None = None, dtype: Dtype | None = None, copy: bool | None = None)[source]¶

Bases: DataFrame

Extension of pandas dataframes with quality-of-life improvements.

Refernces:: [SO22155951]
https://stackoverflow.com/questions/22155951/how-can-i-subclass-a-pandas-dataframe

Example

from geowatch.utils.util_pandas import * # NOQA from geowatch.utils import util_pandas df = util_pandas.DataFrame.random()

classmethod random(rows=10, columns='abcde', rng=None)[source]¶

Create a random data frame for testing.

rows=10 columns=’abcde’ rng = None cls = util_pandas.DataFrame

classmethod coerce(data)[source]¶

Ensures that the input is an instance of our extended DataFrame.

Pandas is generally good about input coercion via its normal constructors, the purpose of this classmethod is to quickly ensure that a DataFrame has all of the extended methods defined by this class without incurring a copy. In this sense it is more similar to :func:numpy.asarray`.

Parameters:: data (DataFrame | ndarray | Iterable | dict) – generally another dataframe, otherwise normal inputs that would be given to the regular pandas dataframe constructor
Return type:: DataFrame

Example

>>> # xdoctest: +REQUIRES(--benchmark)
>>> # This example demonstrates the speed difference between
>>> # recasting as a DataFrame versus using coerce
>>> from geowatch.utils.util_pandas import DataFrame
>>> data = DataFrame.random(rows=10_000)
>>> import timerit
>>> ti = timerit.Timerit(100, bestof=10, verbose=2)
>>> for timer in ti.reset('constructor'):
>>>     with timer:
>>>         DataFrame(data)
>>> for timer in ti.reset('coerce'):
>>>     with timer:
>>>         DataFrame.coerce(data)
>>> # xdoctest: +IGNORE_WANT
Timed constructor for: 100 loops, best of 10
    time per loop: best=2.594 µs, mean=2.783 ± 0.1 µs
Timed coerce for: 100 loops, best of 10
    time per loop: best=246.000 ns, mean=283.000 ± 32.4 ns

safe_drop(labels, axis=0)[source]¶

Like self.drop(), but does not error if the specified labels do not exist.

Parameters:

df (pd.DataFrame) – df
labels (List) – …
axis (int) – todo

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> self = DataFrame({k: np.random.rand(10) for k in 'abcde'})
>>> self.safe_drop(list('bdf'), axis=1)

reorder(head=None, tail=None, axis=0, missing='error', fill_value=nan, **kwargs)[source]¶

Change the order of the row or column index. Unspecified labels will keep their existing order after the specified labels.

Parameters:

head (List | None) – The order of the labels to put at the start of the re-indexed data frame. Unspecified labels keep their relative order and are placed after specified these “head” labels.
tail (List | None) – The order of the labels to put at the end of the re-indexed data frame. Unspecified labels keep their relative order and are placed after before these “tail” labels.
axis (int) – The axis 0 for rows, 1 for columns to reorder.
missing (str) – Policy to handle specified labels that do not exist in the specified axies. Can be either “error”, “drop”, or “fill”. If “drop”, then drop any specified labels that do not exist. If “error”, then raise an error non-existing labels are given. If “fill”, then fill in values for labels that do not exist.
fill_value (Any) – fill value to use when missing is “fill”.

Returns:

Self - DataFrame with modified indexes

Example

>>> from geowatch.utils import util_pandas
>>> self = util_pandas.DataFrame.random(rows=5, columns=['a', 'b', 'c', 'd', 'e', 'f'])
>>> new = self.reorder(['b', 'c'], axis=1)
>>> assert list(new.columns) == ['b', 'c', 'a', 'd', 'e', 'f']
>>> # Set the order of the first and last of the columns
>>> new = self.reorder(head=['b', 'c'], tail=['e', 'd'], axis=1)
>>> assert list(new.columns) == ['b', 'c', 'a', 'f', 'e', 'd']
>>> # Test reordering the rows
>>> new = self.reorder([1, 0], axis=0)
>>> assert list(new.index) == [1, 0, 2, 3, 4]
>>> # Test reordering with a non-existent column
>>> new = self.reorder(['q'], axis=1, missing='drop')
>>> assert list(new.columns) == ['a', 'b', 'c', 'd', 'e', 'f']
>>> new = self.reorder(['q'], axis=1, missing='fill')
>>> assert list(new.columns) == ['q', 'a', 'b', 'c', 'd', 'e', 'f']
>>> import pytest
>>> with pytest.raises(ValueError):
>>>     self.reorder(['q'], axis=1, missing='error')
>>> # Should error if column is given in both head and tail
>>> with pytest.raises(ValueError):
>>>     self.reorder(['c'], ['c'], axis=1, missing='error')

groupby(by=None, **kwargs)[source]¶

Fixed groupby behavior so length-one arguments are handled correctly

Parameters:

df (DataFrame)
** kwargs – groupby kwargs

Example

>>> from geowatch.utils import util_pandas
>>> df = util_pandas.DataFrame({
>>>     'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
>>>     'Color': ['Blue', 'Blue', 'Blue', 'Yellow'],
>>>     'Max Speed': [380., 370., 24., 26.]
>>>     })
>>> new1 = dict(list(df.groupby(['Animal', 'Color'])))
>>> new2 = dict(list(df.groupby(['Animal'])))
>>> new3 = dict(list(df.groupby('Animal')))
>>> assert sorted(new1.keys())[0] == ('Falcon', 'Blue')
>>> assert sorted(new3.keys())[0] == 'Falcon'
>>> # This is the case that is fixed.
>>> assert sorted(new2.keys())[0] == ('Falcon',)

match_columns(pat, hint='glob')[source]¶: Find matching columns in O(N)

search_columns(pat, hint='glob')[source]¶: Find matching columns in O(N)

varied_values(**kwargs)[source]¶

Kwargs:: min_variations=0, max_variations=None, default=ub.NoParam, dropna=False, on_error=’raise’
SeeAlso:: geowatch.utils.result_analysis.varied_values()

varied_value_counts(**kwargs)[source]¶

Kwargs:: min_variations=0, max_variations=None, default=ub.NoParam, dropna=False, on_error=’raise’
SeeAlso:: geowatch.utils.result_analysis.varied_value_counts()

shorten_columns(return_mapping=False, min_length=0)[source]¶

Shorten column names by separating unique suffixes based on the “.” separator.

Parameters:

return_mapping (bool) – if True, returns the
min_length (int) – minimum size of the new column names in terms of parts.

Returns:

Either the new data frame with shortened column names or that data frame and the mapping from old column names to new column names.

Return type:

DataFrame | Tuple[DataFrame, Dict[str, str]]

Example

>>> from geowatch.utils.util_pandas import DataFrame
>>> # If all suffixes are unique, then they are used.
>>> self = DataFrame.random(columns=['id', 'params.metrics.f1', 'params.metrics.acc', 'params.fit.model.lr', 'params.fit.data.seed'])
>>> new = self.shorten_columns()
>>> assert list(new.columns) == ['id', 'f1', 'acc', 'lr', 'seed']
>>> # Conflicting suffixes impose limitations on what can be shortened
>>> self = DataFrame.random(columns=['id', 'params.metrics.magic', 'params.metrics.acc', 'params.fit.model.lr', 'params.fit.data.magic'])
>>> new = self.shorten_columns()
>>> assert list(new.columns) == ['id', 'metrics.magic', 'metrics.acc', 'model.lr', 'data.magic']

argextrema(columns, objective='maximize', k=1)[source]¶

Finds the top K indexes (locs) for given columns.

Parameters:

columns (str | List[str]) – columns to find extrema of. If multiple are given, then secondary columns are used as tiebreakers.
objective (str | List[str]) – Either maximize or minimize (max and min are also accepted). If given as a list, it specifies the criteria for each column, which allows for a mix of maximization and minimization.
k – number of top entries

Returns:

indexes into subset of data that are in the top k for any of the: requested columns.

Return type:

List

Example

>>> from geowatch.utils.util_pandas import DataFrame
>>> # If all suffixes are unique, then they are used.
>>> self = DataFrame.random(columns=['id', 'f1', 'loss'], rows=10)
>>> self.loc[3, 'f1'] = 1.0
>>> self.loc[4, 'f1'] = 1.0
>>> self.loc[5, 'f1'] = 1.0
>>> self.loc[3, 'loss'] = 0.2
>>> self.loc[4, 'loss'] = 0.3
>>> self.loc[5, 'loss'] = 0.1
>>> columns = ['f1', 'loss']
>>> k = 4
>>> top_indexes = self.argextrema(columns=columns, k=k, objective=['max', 'min'])
>>> assert len(top_indexes) == k
>>> print(self.loc[top_indexes])

geowatch.utils.util_pandas.pandas_reorder_columns(df, columns)[source]¶: DEPRECATED: Use DataFrame.reorder() instead

geowatch.utils.util_pandas.pandas_argmaxima(data, columns, k=1)[source]¶

Finds the top K indexes for given columns.

Parameters:

data – pandas data frame
columns – columns to maximize. If multiple are given, then secondary columns are used as tiebreakers.
k – number of top entries

Returns:

indexes into subset of data that are in the top k for any of the: requested columns.

Return type:

List

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame({k: np.random.rand(10) for k in 'abcde'})
>>> columns = ['b', 'd', 'e']
>>> k = 1
>>> top_indexes = pandas_argmaxima(data=data, columns=columns, k=k)
>>> assert len(top_indexes) == k
>>> print(data.loc[top_indexes])

geowatch.utils.util_pandas.pandas_suffix_columns(data, suffixes)[source]¶: Return columns that end with this suffix

geowatch.utils.util_pandas.pandas_nan_eq(a, b)[source]¶

geowatch.utils.util_pandas.pandas_shorten_columns(summary_table, return_mapping=False, min_length=0)[source]¶

Shorten column names

DEPRECATED: Use DataFrame.shorten_columns() instead.

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> df = pd.DataFrame([
>>>     {'param_hashid': 'badbeaf', 'metrics.eval.f1': 0.9, 'metrics.eval.mcc': 0.8, 'metrics.eval.acc': 0.3},
>>>     {'param_hashid': 'decaf', 'metrics.eval.f1': 0.6, 'metrics.eval.mcc': 0.2, 'metrics.eval.acc': 0.4},
>>>     {'param_hashid': 'feedcode', 'metrics.eval.f1': 0.5, 'metrics.eval.mcc': 0.3, 'metrics.eval.acc': 0.1},
>>> ])
>>> print(df.to_string(index=0))
>>> df2 = pandas_shorten_columns(df)
param_hashid  metrics.eval.f1  metrics.eval.mcc  metrics.eval.acc
     badbeaf              0.9               0.8               0.3
       decaf              0.6               0.2               0.4
    feedcode              0.5               0.3               0.1
>>> print(df2.to_string(index=0))
param_hashid  f1  mcc  acc
     badbeaf 0.9  0.8  0.3
       decaf 0.6  0.2  0.4
    feedcode 0.5  0.3  0.1

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> df = pd.DataFrame([
>>>     {'param_hashid': 'badbeaf', 'metrics.eval.f1.mean': 0.9, 'metrics.eval.f1.std': 0.8},
>>>     {'param_hashid': 'decaf', 'metrics.eval.f1.mean': 0.6, 'metrics.eval.f1.std': 0.2},
>>>     {'param_hashid': 'feedcode', 'metrics.eval.f1.mean': 0.5, 'metrics.eval.f1.std': 0.3},
>>> ])
>>> df2 = pandas_shorten_columns(df, min_length=2)
>>> print(df2.to_string(index=0))
param_hashid  f1.mean  f1.std
     badbeaf      0.9     0.8
       decaf      0.6     0.2
    feedcode      0.5     0.3

geowatch.utils.util_pandas.pandas_condense_paths(colvals)[source]¶

Condense a column of paths to keep only the shortest distinguishing suffixes

Parameters:: colvals (pd.Series) – a column containing paths to condense
Returns:: the condensed series and a mapping from old to new
Return type:: Tuple

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> rows = [
>>>     {'path1': '/path/to/a/file1'},
>>>     {'path1': '/path/to/a/file2'},
>>> ]
>>> colvals = pd.DataFrame(rows)['path1']
>>> pandas_condense_paths(colvals)

geowatch.utils.util_pandas.pandas_truncate_items(data, paths=False, max_length=16)[source]¶

from geowatch.utils.util_pandas import pandas_truncate_items

Parameters:: data (pd.DataFrame) – data frame to truncate
Returns:: Tuple[pd.DataFrame, Dict[str, str]]

class geowatch.utils.util_pandas.DotDictDataFrame(*args, **kw)[source]¶

Bases: DataFrame

A proof-of-concept wrapper around pandas that lets us walk down the nested structure a little easier.

The API is a bit weird, and the caches are not invalidated if any column changes, but it does a reasonable job otherwise.

Is there another library out there that does this?

SeeAlso:: DotDict

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> rows = [
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>>     {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8},
>>> ]
>>> self = DotDictDataFrame(rows)
>>> # Test prefix lookup
>>> assert set(self['node1'].columns) == {'node1.id', 'node1.metrics.ap'}
>>> # Test suffix lookup
>>> assert set(self['id'].columns) == {'node1.id', 'node2.id'}
>>> # Test mid-node lookup
>>> assert set(self['metrics'].columns) == {'node1.metrics.ap', 'node2.metrics.ap'}
>>> # Test single lookup
>>> assert set(self[['node1.id']].columns) == {'node1.id'}
>>> # Test glob
>>> assert set(self.find_columns('*metri*')) == {'node1.metrics.ap', 'node2.metrics.ap'}

property nested_columns¶

find_column(col)[source]¶

query_column(col)[source]¶

lookup_suffix_columns(col)[source]¶

lookup_prefix_columns(col)[source]¶

find_columns(pat, hint='glob')[source]¶

match_columns(pat, hint='glob')[source]¶

search_columns(pat, hint='glob')[source]¶

subframe(key, drop_prefix=True)[source]¶: Given a prefix key, return the subet columns that match it with the stripped prefix.

geowatch.utils.util_pandas.pandas_add_prefix(data, prefix)[source]¶

geowatch.utils.util_pandas.aggregate_columns(df, aggregator=None, fallback='const', nonconst_policy='error')[source]¶

Aggregates parameter columns based on per-column strategies / functions specified in aggregator.

Parameters:

hash_cols (None | List[str]) – columns whos values should be hashed together.
aggregator (Dict[str, str | callable]) – a dictionary mapping column names to a callable function that should be used to aggregate them. There a special string codes that we accept as well. Special functions are: hist, hash, min-max, const,
fallback (str | callable) – Aggregator function for any column without an explicit aggregator. Defaults to “const”, which passes one value from the columns through if they are constant. If they are not constant, the nonconst-policy is triggered.
nonconst_policy (str) – Behavior when the aggregator is “const”, but the input is non-constant. The policies are:
- ‘error’ - error if unhandled non-uniform columns exist
- ‘drop’ - remove unhandled non-uniform columns

Returns:

pd.Series

Todo

[ ] optimize this

CommandLine

xdoctest -m geowatch.utils.util_pandas aggregate_columns

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> num_rows = 10
>>> columns = {
>>>     'nums1': np.random.rand(num_rows),
>>>     'nums2': np.random.rand(num_rows),
>>>     'nums3': (np.random.rand(num_rows) * 10).astype(int),
>>>     'nums4': (np.random.rand(num_rows) * 10).astype(int),
>>>     'cats1': np.random.randint(0, 3, num_rows),
>>>     'cats2': np.random.randint(0, 3, num_rows),
>>>     'cats3': np.random.randint(0, 3, num_rows),
>>>     'const1': ['a'] * num_rows,
>>>     'strs1': [np.random.choice(list('abc')) for _ in range(num_rows)],
>>> }
>>> df = pd.DataFrame(columns)
>>> aggregator = ub.udict({
>>>     'nums1': 'mean',
>>>     'nums2': 'max',
>>>     'nums3': 'min-max',
>>>     'nums4': 'stats',
>>>     'cats1': 'histogram',
>>>     'cats3': 'first',
>>>     'cats2': 'hash12',
>>>     'strs1': 'hash12',
>>> })
>>> #
>>> # Test that the const fallback works
>>> row = aggregate_columns(df, aggregator, fallback='const')
>>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1)))
>>> assert row['const1'] == 'a'
>>> row = aggregate_columns(df.iloc[0:1], aggregator, fallback='const')
>>> assert row['const1'] == 'a'
>>> #
>>> # Test that the drop fallback workds
>>> row = aggregate_columns(df, aggregator, fallback='drop')
>>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1)))
>>> assert 'const1' not in row
>>> row = aggregate_columns(df.iloc[0:1], aggregator, fallback='drop')
>>> assert 'const1' not in row
>>> #
>>> # Test that non-constant policy triggers
>>> aggregator_ = aggregator - {'cats3'}
>>> import pytest
>>> with pytest.raises(NonConstantError):
>>>     row = aggregate_columns(df, aggregator_, nonconst_policy='error')
>>> row = aggregate_columns(df, aggregator_, nonconst_policy='drop')
>>> assert 'cats3' not in row
>>> row = aggregate_columns(df, aggregator_, nonconst_policy='hash')
>>> assert 'cats3' in row
>>> #
>>> # Test an empty dataframe returns an empty series
>>> row = aggregate_columns(df.iloc[0:0], aggregator)
>>> assert len(row) == 0
>>> #
>>> # Test single column cases work fine.
>>> for col in df.columns:
...     subdf = df[[col]]
...     subagg = aggregate_columns(subdf, aggregator, fallback='const')
...     assert len(subagg) == 1
>>> #
>>> # Test single column drop case works
>>> subagg = aggregate_columns(df[['cats3']], aggregator_, fallback='const', nonconst_policy='drop')
>>> assert len(subagg) == 0
>>> subagg = aggregate_columns(df[['cats3']], aggregator_, fallback='drop')
>>> assert len(subagg) == 0

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> num_rows = 10
>>> columns = {
>>>     'dates': ['2101-01-01', '1970-01-01', '2000-01-01'],
>>>     'lists': [['a'], ['a', 'b'], []],
>>>     'nums':  [1, 2, 3],
>>> }
>>> df = pd.DataFrame(columns)
>>> aggregator = ub.udict({
>>>     'dates': 'min-max',
>>>     'lists': 'hash',
>>>     'nums':  'mean',
>>> })
>>> row = aggregate_columns(df, aggregator)
>>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1)))

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> import numpy as np
>>> num_rows = 10
>>> columns = {
>>>     'items': [['a'], ['bcd', 'ef'], [], ['3', '234', '2343']],
>>> }
>>> df = pd.DataFrame(columns)
>>> row = aggregate_columns(df, 'last', fallback='const')
>>> columns = {
>>>     'items': ['a', 'c', 'c', 'd'],
>>>     'items2': [['a'], ['bcd', 'ef'], [], ['3', '234', '2343']],
>>> }
>>> df = pd.DataFrame(columns)
>>> row = aggregate_columns(df, 'unique')

class geowatch.utils.util_pandas.SpecialAggregators[source]¶

Bases: object

hash()[source]¶

hash12()[source]¶

unique()[source]¶

min_max()[source]¶

static normalize_special_key(k)[source]¶

special_lut = {'first': <function SpecialAggregators.<lambda>>, 'hash': <function SpecialAggregators.hash>, 'hash12': <function SpecialAggregators.hash12>, 'hist': <function dict_hist>, 'histogram': <function dict_hist>, 'last': <function SpecialAggregators.<lambda>>, 'min_max': <function SpecialAggregators.min_max>, 'stats': <function stats_dict>, 'unique': <function SpecialAggregators.unique>}¶

exception geowatch.utils.util_pandas.NonConstantError[source]¶: Bases: ValueError

geowatch.utils.util_pandas.nan_eq(a, b)[source]¶

class geowatch.utils.util_pandas.GroupbyFutureWrapper[source]¶

Bases: ObjectProxy

Wraps a groupby object to get the new behavior sooner.

Todo

[ ] remove this when pandas 1.x no longer supported

geowatch.utils.util_pandas.pandas_fixed_groupby(df, by=None, **kwargs)[source]¶

Fixed groupby behavior so length-one arguments are handled correctly

Parameters:

df (DataFrame)
** kwargs – groupby kwargs

Example

>>> from geowatch.utils.util_pandas import *  # NOQA
>>> df = pd.DataFrame({
>>>     'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
>>>     'Color': ['Blue', 'Blue', 'Blue', 'Yellow'],
>>>     'Max Speed': [380., 370., 24., 26.]
>>>     })
>>> # Old behavior: In pandas 1.x groupbing by a legth-one list
>>> # would return a single item instead of a length-one tuple
>>> # as a key. In pandas 2.x this changed.
>>> old1 = dict(list(df.groupby(['Animal', 'Color'])))
>>> # In 1.x the keys will be a str, In 2.x the keys will be a List[str] for old2.
>>> old2 = dict(list(df.groupby(['Animal'])))
>>> old3 = dict(list(df.groupby('Animal')))
>>> new1 = dict(list(pandas_fixed_groupby(df, ['Animal', 'Color'])))
>>> # In 1.x and 2.x the keys will alwyas be a List[str] for new2.
>>> new2 = dict(list(pandas_fixed_groupby(df, ['Animal'])))
>>> new3 = dict(list(pandas_fixed_groupby(df, 'Animal')))
>>> assert sorted(new1.keys())[0] == ('Falcon', 'Blue')
>>> assert sorted(old1.keys())[0] == ('Falcon', 'Blue')
>>> assert sorted(new3.keys())[0] == 'Falcon'
>>> assert sorted(old3.keys())[0] == 'Falcon'
>>> # This is the case that is fixed.
>>> assert sorted(new2.keys())[0] == ('Falcon',)
>>> import numpy as np
>>> if np.lib.NumpyVersion(pd.__version__) < '2.0.0':
>>>     assert sorted(old2.keys())[0] == 'Falcon'