geowatch.utils.util_pandas module¶
- class geowatch.utils.util_pandas.DataFrame(data=None, index: Axes | None = None, columns: Axes | None = None, dtype: Dtype | None = None, copy: bool | None = None)[source]¶
Bases:
DataFrame
Extension of pandas dataframes with quality-of-life improvements.
- Refernces:
Example
from geowatch.utils.util_pandas import * # NOQA from geowatch.utils import util_pandas df = util_pandas.DataFrame.random()
- classmethod random(rows=10, columns='abcde', rng=None)[source]¶
Create a random data frame for testing.
rows=10 columns=’abcde’ rng = None cls = util_pandas.DataFrame
- classmethod coerce(data)[source]¶
Ensures that the input is an instance of our extended DataFrame.
Pandas is generally good about input coercion via its normal constructors, the purpose of this classmethod is to quickly ensure that a DataFrame has all of the extended methods defined by this class without incurring a copy. In this sense it is more similar to :func:numpy.asarray`.
- Parameters:
data (DataFrame | ndarray | Iterable | dict) – generally another dataframe, otherwise normal inputs that would be given to the regular pandas dataframe constructor
- Return type:
Example
>>> # xdoctest: +REQUIRES(--benchmark) >>> # This example demonstrates the speed difference between >>> # recasting as a DataFrame versus using coerce >>> from geowatch.utils.util_pandas import DataFrame >>> data = DataFrame.random(rows=10_000) >>> import timerit >>> ti = timerit.Timerit(100, bestof=10, verbose=2) >>> for timer in ti.reset('constructor'): >>> with timer: >>> DataFrame(data) >>> for timer in ti.reset('coerce'): >>> with timer: >>> DataFrame.coerce(data) >>> # xdoctest: +IGNORE_WANT Timed constructor for: 100 loops, best of 10 time per loop: best=2.594 µs, mean=2.783 ± 0.1 µs Timed coerce for: 100 loops, best of 10 time per loop: best=246.000 ns, mean=283.000 ± 32.4 ns
- safe_drop(labels, axis=0)[source]¶
Like
self.drop()
, but does not error if the specified labels do not exist.- Parameters:
df (pd.DataFrame) – df
labels (List) – …
axis (int) – todo
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> import numpy as np >>> self = DataFrame({k: np.random.rand(10) for k in 'abcde'}) >>> self.safe_drop(list('bdf'), axis=1)
- reorder(head=None, tail=None, axis=0, missing='error', fill_value=nan, **kwargs)[source]¶
Change the order of the row or column index. Unspecified labels will keep their existing order after the specified labels.
- Parameters:
head (List | None) – The order of the labels to put at the start of the re-indexed data frame. Unspecified labels keep their relative order and are placed after specified these “head” labels.
tail (List | None) – The order of the labels to put at the end of the re-indexed data frame. Unspecified labels keep their relative order and are placed after before these “tail” labels.
axis (int) – The axis 0 for rows, 1 for columns to reorder.
missing (str) – Policy to handle specified labels that do not exist in the specified axies. Can be either “error”, “drop”, or “fill”. If “drop”, then drop any specified labels that do not exist. If “error”, then raise an error non-existing labels are given. If “fill”, then fill in values for labels that do not exist.
fill_value (Any) – fill value to use when missing is “fill”.
- Returns:
Self - DataFrame with modified indexes
Example
>>> from geowatch.utils import util_pandas >>> self = util_pandas.DataFrame.random(rows=5, columns=['a', 'b', 'c', 'd', 'e', 'f']) >>> new = self.reorder(['b', 'c'], axis=1) >>> assert list(new.columns) == ['b', 'c', 'a', 'd', 'e', 'f'] >>> # Set the order of the first and last of the columns >>> new = self.reorder(head=['b', 'c'], tail=['e', 'd'], axis=1) >>> assert list(new.columns) == ['b', 'c', 'a', 'f', 'e', 'd'] >>> # Test reordering the rows >>> new = self.reorder([1, 0], axis=0) >>> assert list(new.index) == [1, 0, 2, 3, 4] >>> # Test reordering with a non-existent column >>> new = self.reorder(['q'], axis=1, missing='drop') >>> assert list(new.columns) == ['a', 'b', 'c', 'd', 'e', 'f'] >>> new = self.reorder(['q'], axis=1, missing='fill') >>> assert list(new.columns) == ['q', 'a', 'b', 'c', 'd', 'e', 'f'] >>> import pytest >>> with pytest.raises(ValueError): >>> self.reorder(['q'], axis=1, missing='error') >>> # Should error if column is given in both head and tail >>> with pytest.raises(ValueError): >>> self.reorder(['c'], ['c'], axis=1, missing='error')
- groupby(by=None, **kwargs)[source]¶
Fixed groupby behavior so length-one arguments are handled correctly
- Parameters:
df (DataFrame)
** kwargs – groupby kwargs
Example
>>> from geowatch.utils import util_pandas >>> df = util_pandas.DataFrame({ >>> 'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'], >>> 'Color': ['Blue', 'Blue', 'Blue', 'Yellow'], >>> 'Max Speed': [380., 370., 24., 26.] >>> }) >>> new1 = dict(list(df.groupby(['Animal', 'Color']))) >>> new2 = dict(list(df.groupby(['Animal']))) >>> new3 = dict(list(df.groupby('Animal'))) >>> assert sorted(new1.keys())[0] == ('Falcon', 'Blue') >>> assert sorted(new3.keys())[0] == 'Falcon' >>> # This is the case that is fixed. >>> assert sorted(new2.keys())[0] == ('Falcon',)
- varied_values(**kwargs)[source]¶
- Kwargs:
min_variations=0, max_variations=None, default=ub.NoParam, dropna=False, on_error=’raise’
- SeeAlso:
- varied_value_counts(**kwargs)[source]¶
- Kwargs:
min_variations=0, max_variations=None, default=ub.NoParam, dropna=False, on_error=’raise’
- SeeAlso:
- shorten_columns(return_mapping=False, min_length=0)[source]¶
Shorten column names by separating unique suffixes based on the “.” separator.
- Parameters:
return_mapping (bool) – if True, returns the
min_length (int) – minimum size of the new column names in terms of parts.
- Returns:
Either the new data frame with shortened column names or that data frame and the mapping from old column names to new column names.
- Return type:
Example
>>> from geowatch.utils.util_pandas import DataFrame >>> # If all suffixes are unique, then they are used. >>> self = DataFrame.random(columns=['id', 'params.metrics.f1', 'params.metrics.acc', 'params.fit.model.lr', 'params.fit.data.seed']) >>> new = self.shorten_columns() >>> assert list(new.columns) == ['id', 'f1', 'acc', 'lr', 'seed'] >>> # Conflicting suffixes impose limitations on what can be shortened >>> self = DataFrame.random(columns=['id', 'params.metrics.magic', 'params.metrics.acc', 'params.fit.model.lr', 'params.fit.data.magic']) >>> new = self.shorten_columns() >>> assert list(new.columns) == ['id', 'metrics.magic', 'metrics.acc', 'model.lr', 'data.magic']
- argextrema(columns, objective='maximize', k=1)[source]¶
Finds the top K indexes (locs) for given columns.
- Parameters:
columns (str | List[str]) – columns to find extrema of. If multiple are given, then secondary columns are used as tiebreakers.
objective (str | List[str]) – Either maximize or minimize (max and min are also accepted). If given as a list, it specifies the criteria for each column, which allows for a mix of maximization and minimization.
k – number of top entries
- Returns:
- indexes into subset of data that are in the top k for any of the
requested columns.
- Return type:
Example
>>> from geowatch.utils.util_pandas import DataFrame >>> # If all suffixes are unique, then they are used. >>> self = DataFrame.random(columns=['id', 'f1', 'loss'], rows=10) >>> self.loc[3, 'f1'] = 1.0 >>> self.loc[4, 'f1'] = 1.0 >>> self.loc[5, 'f1'] = 1.0 >>> self.loc[3, 'loss'] = 0.2 >>> self.loc[4, 'loss'] = 0.3 >>> self.loc[5, 'loss'] = 0.1 >>> columns = ['f1', 'loss'] >>> k = 4 >>> top_indexes = self.argextrema(columns=columns, k=k, objective=['max', 'min']) >>> assert len(top_indexes) == k >>> print(self.loc[top_indexes])
- geowatch.utils.util_pandas.pandas_reorder_columns(df, columns)[source]¶
DEPRECATED: Use
DataFrame.reorder()
instead
- geowatch.utils.util_pandas.pandas_argmaxima(data, columns, k=1)[source]¶
Finds the top K indexes for given columns.
- Parameters:
data – pandas data frame
columns – columns to maximize. If multiple are given, then secondary columns are used as tiebreakers.
k – number of top entries
- Returns:
- indexes into subset of data that are in the top k for any of the
requested columns.
- Return type:
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> import numpy as np >>> import pandas as pd >>> data = pd.DataFrame({k: np.random.rand(10) for k in 'abcde'}) >>> columns = ['b', 'd', 'e'] >>> k = 1 >>> top_indexes = pandas_argmaxima(data=data, columns=columns, k=k) >>> assert len(top_indexes) == k >>> print(data.loc[top_indexes])
- geowatch.utils.util_pandas.pandas_suffix_columns(data, suffixes)[source]¶
Return columns that end with this suffix
- geowatch.utils.util_pandas.pandas_shorten_columns(summary_table, return_mapping=False, min_length=0)[source]¶
Shorten column names
DEPRECATED: Use
DataFrame.shorten_columns()
instead.Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> df = pd.DataFrame([ >>> {'param_hashid': 'badbeaf', 'metrics.eval.f1': 0.9, 'metrics.eval.mcc': 0.8, 'metrics.eval.acc': 0.3}, >>> {'param_hashid': 'decaf', 'metrics.eval.f1': 0.6, 'metrics.eval.mcc': 0.2, 'metrics.eval.acc': 0.4}, >>> {'param_hashid': 'feedcode', 'metrics.eval.f1': 0.5, 'metrics.eval.mcc': 0.3, 'metrics.eval.acc': 0.1}, >>> ]) >>> print(df.to_string(index=0)) >>> df2 = pandas_shorten_columns(df) param_hashid metrics.eval.f1 metrics.eval.mcc metrics.eval.acc badbeaf 0.9 0.8 0.3 decaf 0.6 0.2 0.4 feedcode 0.5 0.3 0.1 >>> print(df2.to_string(index=0)) param_hashid f1 mcc acc badbeaf 0.9 0.8 0.3 decaf 0.6 0.2 0.4 feedcode 0.5 0.3 0.1
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> df = pd.DataFrame([ >>> {'param_hashid': 'badbeaf', 'metrics.eval.f1.mean': 0.9, 'metrics.eval.f1.std': 0.8}, >>> {'param_hashid': 'decaf', 'metrics.eval.f1.mean': 0.6, 'metrics.eval.f1.std': 0.2}, >>> {'param_hashid': 'feedcode', 'metrics.eval.f1.mean': 0.5, 'metrics.eval.f1.std': 0.3}, >>> ]) >>> df2 = pandas_shorten_columns(df, min_length=2) >>> print(df2.to_string(index=0)) param_hashid f1.mean f1.std badbeaf 0.9 0.8 decaf 0.6 0.2 feedcode 0.5 0.3
- geowatch.utils.util_pandas.pandas_condense_paths(colvals)[source]¶
Condense a column of paths to keep only the shortest distinguishing suffixes
- Parameters:
colvals (pd.Series) – a column containing paths to condense
- Returns:
the condensed series and a mapping from old to new
- Return type:
Tuple
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> rows = [ >>> {'path1': '/path/to/a/file1'}, >>> {'path1': '/path/to/a/file2'}, >>> ] >>> colvals = pd.DataFrame(rows)['path1'] >>> pandas_condense_paths(colvals)
- geowatch.utils.util_pandas.pandas_truncate_items(data, paths=False, max_length=16)[source]¶
from geowatch.utils.util_pandas import pandas_truncate_items
- Parameters:
data (pd.DataFrame) – data frame to truncate
- Returns:
Tuple[pd.DataFrame, Dict[str, str]]
- class geowatch.utils.util_pandas.DotDictDataFrame(*args, **kw)[source]¶
Bases:
DataFrame
A proof-of-concept wrapper around pandas that lets us walk down the nested structure a little easier.
The API is a bit weird, and the caches are not invalidated if any column changes, but it does a reasonable job otherwise.
Is there another library out there that does this?
- SeeAlso:
DotDict
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> rows = [ >>> {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8}, >>> {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8}, >>> {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8}, >>> {'node1.id': 1, 'node2.id': 2, 'node1.metrics.ap': 0.5, 'node2.metrics.ap': 0.8}, >>> ] >>> self = DotDictDataFrame(rows) >>> # Test prefix lookup >>> assert set(self['node1'].columns) == {'node1.id', 'node1.metrics.ap'} >>> # Test suffix lookup >>> assert set(self['id'].columns) == {'node1.id', 'node2.id'} >>> # Test mid-node lookup >>> assert set(self['metrics'].columns) == {'node1.metrics.ap', 'node2.metrics.ap'} >>> # Test single lookup >>> assert set(self[['node1.id']].columns) == {'node1.id'} >>> # Test glob >>> assert set(self.find_columns('*metri*')) == {'node1.metrics.ap', 'node2.metrics.ap'}
- property nested_columns¶
- geowatch.utils.util_pandas.aggregate_columns(df, aggregator=None, fallback='const', nonconst_policy='error')[source]¶
Aggregates parameter columns based on per-column strategies / functions specified in
aggregator
.- Parameters:
hash_cols (None | List[str]) – columns whos values should be hashed together.
aggregator (Dict[str, str | callable]) – a dictionary mapping column names to a callable function that should be used to aggregate them. There a special string codes that we accept as well. Special functions are: hist, hash, min-max, const,
fallback (str | callable) – Aggregator function for any column without an explicit aggregator. Defaults to “const”, which passes one value from the columns through if they are constant. If they are not constant, the nonconst-policy is triggered.
nonconst_policy (str) – Behavior when the aggregator is “const”, but the input is non-constant. The policies are:
‘error’ - error if unhandled non-uniform columns exist
‘drop’ - remove unhandled non-uniform columns
- Returns:
pd.Series
Todo
[ ] optimize this
CommandLine
xdoctest -m geowatch.utils.util_pandas aggregate_columns
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> import numpy as np >>> num_rows = 10 >>> columns = { >>> 'nums1': np.random.rand(num_rows), >>> 'nums2': np.random.rand(num_rows), >>> 'nums3': (np.random.rand(num_rows) * 10).astype(int), >>> 'nums4': (np.random.rand(num_rows) * 10).astype(int), >>> 'cats1': np.random.randint(0, 3, num_rows), >>> 'cats2': np.random.randint(0, 3, num_rows), >>> 'cats3': np.random.randint(0, 3, num_rows), >>> 'const1': ['a'] * num_rows, >>> 'strs1': [np.random.choice(list('abc')) for _ in range(num_rows)], >>> } >>> df = pd.DataFrame(columns) >>> aggregator = ub.udict({ >>> 'nums1': 'mean', >>> 'nums2': 'max', >>> 'nums3': 'min-max', >>> 'nums4': 'stats', >>> 'cats1': 'histogram', >>> 'cats3': 'first', >>> 'cats2': 'hash12', >>> 'strs1': 'hash12', >>> }) >>> # >>> # Test that the const fallback works >>> row = aggregate_columns(df, aggregator, fallback='const') >>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1))) >>> assert row['const1'] == 'a' >>> row = aggregate_columns(df.iloc[0:1], aggregator, fallback='const') >>> assert row['const1'] == 'a' >>> # >>> # Test that the drop fallback workds >>> row = aggregate_columns(df, aggregator, fallback='drop') >>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1))) >>> assert 'const1' not in row >>> row = aggregate_columns(df.iloc[0:1], aggregator, fallback='drop') >>> assert 'const1' not in row >>> # >>> # Test that non-constant policy triggers >>> aggregator_ = aggregator - {'cats3'} >>> import pytest >>> with pytest.raises(NonConstantError): >>> row = aggregate_columns(df, aggregator_, nonconst_policy='error') >>> row = aggregate_columns(df, aggregator_, nonconst_policy='drop') >>> assert 'cats3' not in row >>> row = aggregate_columns(df, aggregator_, nonconst_policy='hash') >>> assert 'cats3' in row >>> # >>> # Test an empty dataframe returns an empty series >>> row = aggregate_columns(df.iloc[0:0], aggregator) >>> assert len(row) == 0 >>> # >>> # Test single column cases work fine. >>> for col in df.columns: ... subdf = df[[col]] ... subagg = aggregate_columns(subdf, aggregator, fallback='const') ... assert len(subagg) == 1 >>> # >>> # Test single column drop case works >>> subagg = aggregate_columns(df[['cats3']], aggregator_, fallback='const', nonconst_policy='drop') >>> assert len(subagg) == 0 >>> subagg = aggregate_columns(df[['cats3']], aggregator_, fallback='drop') >>> assert len(subagg) == 0
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> import numpy as np >>> num_rows = 10 >>> columns = { >>> 'dates': ['2101-01-01', '1970-01-01', '2000-01-01'], >>> 'lists': [['a'], ['a', 'b'], []], >>> 'nums': [1, 2, 3], >>> } >>> df = pd.DataFrame(columns) >>> aggregator = ub.udict({ >>> 'dates': 'min-max', >>> 'lists': 'hash', >>> 'nums': 'mean', >>> }) >>> row = aggregate_columns(df, aggregator) >>> print('row = {}'.format(ub.urepr(row.to_dict(), nl=1)))
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> import numpy as np >>> num_rows = 10 >>> columns = { >>> 'items': [['a'], ['bcd', 'ef'], [], ['3', '234', '2343']], >>> } >>> df = pd.DataFrame(columns) >>> row = aggregate_columns(df, 'last', fallback='const') >>> columns = { >>> 'items': ['a', 'c', 'c', 'd'], >>> 'items2': [['a'], ['bcd', 'ef'], [], ['3', '234', '2343']], >>> } >>> df = pd.DataFrame(columns) >>> row = aggregate_columns(df, 'unique')
- class geowatch.utils.util_pandas.SpecialAggregators[source]¶
Bases:
object
- special_lut = {'first': <function SpecialAggregators.<lambda>>, 'hash': <function SpecialAggregators.hash>, 'hash12': <function SpecialAggregators.hash12>, 'hist': <function dict_hist>, 'histogram': <function dict_hist>, 'last': <function SpecialAggregators.<lambda>>, 'min_max': <function SpecialAggregators.min_max>, 'stats': <function stats_dict>, 'unique': <function SpecialAggregators.unique>}¶
- exception geowatch.utils.util_pandas.NonConstantError[source]¶
Bases:
ValueError
- class geowatch.utils.util_pandas.GroupbyFutureWrapper[source]¶
Bases:
ObjectProxy
Wraps a groupby object to get the new behavior sooner.
- geowatch.utils.util_pandas.pandas_fixed_groupby(df, by=None, **kwargs)[source]¶
Fixed groupby behavior so length-one arguments are handled correctly
- Parameters:
df (DataFrame)
** kwargs – groupby kwargs
Example
>>> from geowatch.utils.util_pandas import * # NOQA >>> df = pd.DataFrame({ >>> 'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'], >>> 'Color': ['Blue', 'Blue', 'Blue', 'Yellow'], >>> 'Max Speed': [380., 370., 24., 26.] >>> }) >>> # Old behavior >>> old1 = dict(list(df.groupby(['Animal', 'Color']))) >>> old2 = dict(list(df.groupby(['Animal']))) >>> old3 = dict(list(df.groupby('Animal'))) >>> new1 = dict(list(pandas_fixed_groupby(df, ['Animal', 'Color']))) >>> new2 = dict(list(pandas_fixed_groupby(df, ['Animal']))) >>> new3 = dict(list(pandas_fixed_groupby(df, 'Animal'))) >>> assert sorted(new1.keys())[0] == ('Falcon', 'Blue') >>> assert sorted(old1.keys())[0] == ('Falcon', 'Blue') >>> assert sorted(new3.keys())[0] == 'Falcon' >>> assert sorted(old3.keys())[0] == 'Falcon' >>> # This is the case that is fixed. >>> assert sorted(new2.keys())[0] == ('Falcon',) >>> import numpy as np >>> if np.lib.NumpyVersion(pd.__version__) < '2.0.0': >>> assert sorted(old2.keys())[0] == 'Falcon'