geowatch.utils.result_analysis module¶

This utility provides a method to define a table of hyperparameter key/values and associated metric key/values. Given this table, and information about if metrics are better when they are higher / lower, the ResultAnalysis class uses several statistical methods to estimate parameter importance.

Example

>>> # Given a list of experiments, configs, and results
>>> from geowatch.utils.result_analysis import ResultAnalysis
>>> # Given a table of experiments with parameters, and metrics
>>> table = [
>>>     Result('expt0', {'param1': 2, 'param2': 'b'}, {'f1': 0.75, 'loss': 0.5}),
>>>     Result('expt1', {'param1': 0, 'param2': 'c'}, {'f1': 0.92, 'loss': 0.4}),
>>>     Result('expt2', {'param1': 1, 'param2': 'b'}, {'f1': 0.77, 'loss': 0.3}),
>>>     Result('expt3', {'param1': 1, 'param2': 'a'}, {'f1': 0.67, 'loss': 0.2}),
>>> ]
>>> # Create a ResultAnalysis object and tell it what metrics should be maximized / minimized
>>> analysis = ResultAnalysis(table, metric_objectives={'f1': 'max', 'loss': 'min'})
>>> # An overall analysis can be obtained as follows
>>> analysis.analysis()  # xdoctest: +IGNORE_WANT
PARAMETER: param2 - METRIC: f1
==============================
f1      count  mean       std   min    25%   50%    75%   max
param2
c         1.0  0.92       NaN  0.92  0.920  0.92  0.920  0.92
b         2.0  0.76  0.014142  0.75  0.755  0.76  0.765  0.77
a         1.0  0.67       NaN  0.67  0.670  0.67  0.670  0.67
...
ANOVA: If p is low, the param 'param2' might have an effect
  Rank-ANOVA: p=0.25924026
  Mean-ANOVA: p=0.07823610
...
Pairwise T-Tests
  If p is low, param2=c may outperform param2=b.
    ttest_ind:  p=nan
  If p is low, param2=b may outperform param2=a.
    ttest_ind:  p=nan
    ttest_rel:  p=nan, n_pairs=1
...
PARAMETER: param1 - METRIC: loss
================================
loss    count  mean       std  min    25%   50%    75%  max
param1
1         2.0  0.25  0.070711  0.2  0.225  0.25  0.275  0.3
0         1.0  0.40       NaN  0.4  0.400  0.40  0.400  0.4
2         1.0  0.50       NaN  0.5  0.500  0.50  0.500  0.5
...
ANOVA: If p is low, the param 'param1' might have an effect
  Rank-ANOVA: p=0.25924026
  Mean-ANOVA: p=0.31622777
...
Pairwise T-Tests
  If p is low, param1=1 may outperform 0.
    ttest_ind:  p=nan
  If p is low, param1=0 may outperform 2.
    ttest_ind:  p=nan
  param_name metric  anova_rank_H  anova_rank_p  anova_mean_F  anova_mean_p
0     param2     f1           2.7       0.25924       81.1875      0.078236
3     param1   loss           2.7       0.25924        4.5000      0.316228
1     param2   loss           1.8       0.40657        0.7500      0.632456
2     param1     f1           1.8       0.40657        2.7675      0.391181

>>> # But specific parameters or groups of parameters can be inspected
>>> # individually
>>> analysis.build()
>>> analysis.abalate(['param1'], metrics=['f1'])  # xdoctest: +IGNORE_WANT
skillboard.ratings = {
    (0,): Rating(mu=25, sigma=8.333333333333334),
    (1,): Rating(mu=27.63523138347365, sigma=8.065506316323548),
    (2,): Rating(mu=22.36476861652635, sigma=8.065506316323548),
}
win_probs = {
    (0,): 0.3333333333333333,
    (1,): 0.3445959888771101,
    (2,): 0.32207067778955656,
}
...
When config(param1=1) is better than config(param1=2), the improvement in f1 is
   count  mean  std   min   25%   50%   75%   max
0    1.0  0.02  NaN  0.02  0.02  0.02  0.02  0.02
...
When config(param1=2) is better than config(param1=1), the improvement in f1 is
   count  mean  std  min  25%  50%  75%  max
0    0.0   NaN  NaN  NaN  NaN  NaN  NaN  NaN

Example

>>> # Simple example for computing a p-values between a set of baseline
>>> # results and hypothesis you think might do better.
>>> # Given a list of experiments, configs, and results
>>> from geowatch.utils.result_analysis import ResultAnalysis, Result
>>> # Given a table of experiments with parameters, and metrics
>>> table = [
>>>     Result('expt0', {'group': 'baseline'}, {'f1': 0.75}),
>>>     Result('expt1', {'group': 'baseline'}, {'f1': 0.72}),
>>>     Result('expt2', {'group': 'baseline'}, {'f1': 0.79}),
>>>     Result('expt3', {'group': 'baseline'}, {'f1': 0.73}),
>>>     Result('expt4', {'group': 'baseline'}, {'f1': 0.74}),
>>>     Result('expt5', {'group': 'baseline'}, {'f1': 0.74}),
>>>     Result('expt5', {'group': 'hypothesis'}, {'f1': 0.76}),
>>>     Result('expt6', {'group': 'hypothesis'}, {'f1': 0.78}),
>>>     Result('expt7', {'group': 'hypothesis'}, {'f1': 0.77}),
>>>     Result('expt8', {'group': 'hypothesis'}, {'f1': 0.75}),
>>> ]
>>> # Create a ResultAnalysis object and tell it what metrics should be maximized / minimized
>>> analysis = ResultAnalysis(table, metric_objectives={'f1': 'max'})
>>> # An overall analysis can be obtained as follows
>>> analysis.analysis()

This seems related to [RijnHutter2018]. Need to look more closely to determine its exact relation and what we can learn from it (or what we do better / worse). Also see followup [Probst2019].

References

[RijnHutter2018]

Hyperparameter Importance Across Datasets - https://arxiv.org/pdf/1710.04725.pdf

[Probst2019]

https://www.jmlr.org/papers/volume20/18-444/18-444.pdf

Look into:

https://scikit-optimize.github.io/stable/

https://wandb.ai/site/articles/find-the-most-important-hyperparameters-in-seconds

https://docs.ray.io/en/latest/tune/index.html

ray.tune

Requires:: pip install ray pip install openskill

class geowatch.utils.result_analysis.Result(name, params, metrics, meta=None)[source]¶

Bases: NiceRepr

Storage of names, parameters, and quality metrics for a single experiment.

Variables:

name (str | None) – Name of the experiment. Optional. This is unused in the analysis. (i.e. names will never be used computationally. Use them for keys)
params (Dict[str, object]) – configuration of the experiment. This is a dictionary mapping a parameter name to its value.
metrics (Dict[str, float]) – quantitative results of the experiment This is a dictionary for each quality metric computed on this result.
meta (Dict | None) – any other metadata about this result. This is unused in the analysis.

Example

>>> self = Result.demo(rng=32)
>>> print('self = {}'.format(self))
self = <Result(name=53f57161,f1=0.33,acc=0.75,param1=1,param2=6.67,param3=a)>

Example

>>> self = Result.demo(mode='alt', rng=32)
>>> print('self = {}'.format(self))

to_dict()[source]¶

classmethod demo(mode='null', rng=None)[source]¶

class geowatch.utils.result_analysis.ResultTable(params, metrics)[source]¶

Bases: object

An object that stores two tables of corresponding metrics and parameters.

Helps abstract away the old Result object.

Example

>>> from geowatch.utils.result_analysis import *  # NOQA
>>> self = ResultTable.demo()
>>> print(self.table)

property table¶

property result_list¶

classmethod demo(num=10, mode='null', rng=None)[source]¶

classmethod coerce(data, param_cols=None, metric_cols=None)[source]¶

property varied¶

class geowatch.utils.result_analysis.ResultAnalysis(results, metrics=None, params=None, ignore_params=None, ignore_metrics=None, metric_objectives=None, abalation_orders={1}, default_objective='max', p_threshold=0.05)[source]¶

Bases: NiceRepr

Groups and runs stats on results

Runs statistical tests on sets of configuration-metrics pairs

Variables:

results (List[Result] | DataFrame) – list of results, or something coercable to one.
ignore_metrics (Set[str]) – metrics to ignore
ignore_params (Set[str]) – parameters to ignore
metric_objectives (Dict[str, str]) – indicate if each metrix should be maximized “max” or minimized “min”
metrics (List[str]) – only consider these metrics
params (List[str]) – if given, only consider these params
abalation_orders (Set[int]) – The number of parameters to be held constant in each statistical grouping. Defaults to 1, so it groups together results where 1 variable is held constant. Including 2 will include pairwise settings of parameters to be held constant. Using -1 or -2 means all but 1 or 2 parameters will be held constant, repsectively.
default_objective (str) – assume max or min for unknown metrics

Example

>>> self = ResultAnalysis.demo()
>>> self.analysis()

Example

>>> self = ResultAnalysis.demo(num=5000, mode='alt')
>>> self.analysis()

Example

>>> # Given a list of experiments, configs, and results
>>> # Create a ResultAnalysis object
>>> from geowatch.utils.result_analysis import *  # NOQA
>>> result_table = ResultTable.coerce([
>>>     Result('expt0', {'param1': 2, 'param3': 'b'}, {'f1': 0.75}),
>>>     Result('expt1', {'param1': 0, 'param3': 'c'}, {'f1': 0.92}),
>>>     Result('expt2', {'param1': 1, 'param3': 'b'}, {'f1': 0.77}),
>>>     Result('expt3', {'param1': 1, 'param3': 'a'}, {'f1': 0.67}),
>>>     Result('expt4', {'param1': 0, 'param3': 'c'}, {'f1': 0.98}),
>>>     Result('expt5', {'param1': 2, 'param3': 'a'}, {'f1': 0.86}),
>>>     Result('expt6', {'param1': 1, 'param3': 'c'}, {'f1': 0.77}),
>>>     Result('expt7', {'param1': 1, 'param3': 'c'}, {'f1': 0.41}),
>>>     Result('expt8', {'param1': 1, 'param3': 'a'}, {'f1': 0.64}),
>>>     Result('expt9', {'param1': 0, 'param3': 'b'}, {'f1': 0.95}),
>>> ])
>>> analysis = ResultAnalysis(result_table)
>>> # Calling the analysis method prints something like the following
>>> analysis.analysis()

PARAMETER ‘param1’ - f1¶

f1 mean std max min num best param1 0 0.950 0.030000 0.98 0.92 3.0 0.98 2 0.805 0.077782 0.86 0.75 2.0 0.86 1 0.652 0.147377 0.77 0.41 5.0 0.77

ANOVA hypothesis (roughly): the param ‘param1’ has no effect on the metric

Reject this hypothesis if the p value is less than a threshold

Rank-ANOVA: p=0.0397 Mean-ANOVA: p=0.0277

Pairwise T-Tests

Is param1=0 about as good as param1=2?: ttest_ind: p=0.2058
Is param1=1 about as good as param1=2?: ttest_ind: p=0.1508

PARAMETER ‘param3’ - f1¶

f1 mean std max min num best param3 c 0.770000 0.255734 0.98 0.41 4.0 0.98 b 0.823333 0.110151 0.95 0.75 3.0 0.95 a 0.723333 0.119304 0.86 0.64 3.0 0.86

ANOVA hypothesis (roughly): the param ‘param3’ has no effect on the metric

Reject this hypothesis if the p value is less than a threshold

Rank-ANOVA: p=0.5890 Mean-ANOVA: p=0.8145

Pairwise T-Tests

Is param3=b about as good as param3=c?: ttest_ind: p=0.7266
Is param3=a about as good as param3=b?: ttest_ind: p=0.3466 ttest_rel: p=0.3466
Is param3=a about as good as param3=c?: ttest_ind: p=0.7626

classmethod demo(num=10, mode='null', rng=None)[source]¶

run()[source]¶

analysis()[source]¶

property table¶

metric_table()[source]¶

property varied¶

abaltion_groups(param_group, k=2)[source]¶

Return groups where the specified parameter(s) are varied, but all other non-ignored parameters are held the same.

Parameters:

param_group (str | List[str]) – One or more parameters that are allowed to vary
k (int) – minimum number of items a group must contain to be returned

Returns:

a list of subsets of in the table where all but the specified (non-ignored) parameters are allowed to vary.

Return type:

List[DataFrame]

Example

>>> self = ResultAnalysis.demo()
>>> param = 'param2'
>>> self.abaltion_groups(param)

tune()[source]¶

Look into:: # Old bayes opt? https://github.com/Erotemic/clab/blob/master/clab/live/urban_pred.py#L459

Example

>>> self = ResultAnalysis.demo(100)

ablate(param_group, metrics=None, use_openskill='auto')[source]¶

Todo

rectify with test-group

Example

>>> self = ResultAnalysis.demo(100)
>>> param = 'param2'
>>> # xdoctest: +REQUIRES(module:openskill)
>>> self.ablate(param)

>>> self = ResultAnalysis.demo()
>>> param_group = ['param2', 'param3']
>>> # xdoctest: +REQUIRES(module:openskill)
>>> self.ablate(param_group)

abalation_groups(param_group, k=2)¶

Return groups where the specified parameter(s) are varied, but all other non-ignored parameters are held the same.

Parameters:

param_group (str | List[str]) – One or more parameters that are allowed to vary
k (int) – minimum number of items a group must contain to be returned

Returns:

a list of subsets of in the table where all but the specified (non-ignored) parameters are allowed to vary.

Return type:

List[DataFrame]

Example

>>> self = ResultAnalysis.demo()
>>> param = 'param2'
>>> self.abaltion_groups(param)

abalate(param_group, metrics=None, use_openskill='auto')¶

Todo

rectify with test-group

Example

>>> self = ResultAnalysis.demo(100)
>>> param = 'param2'
>>> # xdoctest: +REQUIRES(module:openskill)
>>> self.ablate(param)

>>> self = ResultAnalysis.demo()
>>> param_group = ['param2', 'param3']
>>> # xdoctest: +REQUIRES(module:openskill)
>>> self.ablate(param_group)

test_group(param_group, metric_key)[source]¶

Get stats for a particular metric / constant group

Parameters:

param_group (List[str]) – group of parameters to hold constant.
metric_key (str) – The metric to test.

Returns:

dict # TODO : document these stats clearly and accurately

Example

>>> self = ResultAnalysis.demo(num=100)
>>> print(self.table)
>>> param_group = ['param2', 'param1']
>>> metric_key = 'f1'
>>> stats_row = self.test_group(param_group, metric_key)
>>> print('stats_row = {}'.format(ub.urepr(stats_row, nl=2, sort=0, precision=2)))

build()[source]¶

report()[source]¶

conclusions()[source]¶

plot(xlabel, metric_key, group_labels, data=None, **kwargs)[source]¶

Parameters:: group_labels (dict) – Tells seaborn what attributes to use to distinsuish curves like hue, size, marker. Also can contain “col” for use with FacetGrid, and “fig” to separate different configurations into different figures.
Returns:: A list for each figure containing info abou that figure for any postprocessing.
Return type:: List[Dict]

Example

>>> self = ResultAnalysis.demo(num=1000, mode='alt')
>>> self.analysis()
>>> print('self = {}'.format(self))
>>> print('self.varied = {}'.format(ub.urepr(self.varied, nl=1)))
>>> # xdoctest: +REQUIRES(--show)
>>> # xdoctest: +REQUIRES(module:kwplot)
>>> import kwplot
>>> kwplot.autosns()
>>> xlabel = 'x'
>>> metric_key = 'acc'
>>> group_labels = {
>>>     'fig': ['u'],
>>>     'col': ['y', 'v'],
>>>     'hue': ['z'],
>>>     'size': [],
>>> }
>>> kwargs = {'xscale': 'log', 'yscale': 'log'}
>>> self.plot(xlabel, metric_key, group_labels, **kwargs)

class geowatch.utils.result_analysis.SkillTracker(player_ids)[source]¶

Bases: object

Wrapper around openskill

Parameters:: player_ids (List[T]) – a list of ids (usually ints) used to represent each player

Example

>>> # xdoctest: +REQUIRES(module:openskill)
>>> self = SkillTracker([1, 2, 3, 4, 5])
>>> self.observe([2, 3])  # Player 2 beat player 3.
>>> self.observe([1, 2, 5, 3])  # Player 3 didnt play this round.
>>> self.observe([2, 3, 4, 5, 1])  # Everyone played, player 2 won.
>>> win_probs = self.predict_win()
>>> print('win_probs = {}'.format(ub.urepr(win_probs, nl=1, precision=2)))
win_probs = {
    1: 0.20,
    2: 0.21,
    3: 0.19,
    4: 0.20,
    5: 0.20,
}

Requirements:: openskill

predict_win()[source]¶

Estimate the probability that a particular player will win given the current ratings.

Returns:: mapping from player ids to win probabilites
Return type:: Dict[T, float]

observe(ranking)[source]¶

After simulating a round, pass the ranked order of who won (winner is first, looser is last) to this function. And it updates the rankings.

Parameters:: ranking (List[T]) – ranking of all the players that played in this round winners are at the front (0-th place) of the list.

class geowatch.utils.result_analysis.UnhashablePlaceholder[source]¶: Bases: str

geowatch.utils.result_analysis.varied_values(longform, min_variations=0, max_variations=None, default=NoParam, dropna=False, on_error='raise')[source]¶

Given a list of dictionaries, find the values that differ between them.

Parameters:

longform (List[Dict[KT, VT]] | DataFrame) – This is longform data, as described in [SeabornLongform]. It is a list of dictionaries.

Each item in the list - or row - is a dictionary and can be thought of as an observation. The keys in each dictionary are the columns. The values of the dictionary must be hashable. Lists will be converted into tuples.
min_variations (int, default=0) – “columns” with fewer than min_variations unique values are removed from the result.
max_variations (int | None) – If specified only return items with fewer than this number of variations.
default (VT | NoParamType) – if specified, unspecified columns are given this value. Defaults to NoParam.
on_error (str) – Error policy when trying to add a non-hashable type. Default to “raise”. Can be “raise”, “ignore”, or “placeholder”, which will impute a hashable error message.

Returns:

a mapping from each “column” to the set of unique values it took over each “row”. If a column is not specified for each row, it is assumed to take a default value, if it is specified.

Return type:

Dict[KT, List[VT]]

Raises:

KeyError – If default is unspecified and all the rows do not contain the same columns.

References

[SeabornLongform]

https://seaborn.pydata.org/tutorial/data_structure.html#long-form-data

geowatch.utils.result_analysis.varied_value_counts(longform, min_variations=0, max_variations=None, default=NoParam, dropna=False, on_error='raise')[source]¶

Given a list of dictionaries, find the values that differ between them.

Parameters:

longform (List[Dict[KT, VT]] | DataFrame) – This is longform data, as described in [SeabornLongform]. It is a list of dictionaries.

Each item in the list - or row - is a dictionary and can be thought of as an observation. The keys in each dictionary are the columns. The values of the dictionary must be hashable. Lists will be converted into tuples.
min_variations (int) – “columns” with fewer than min_variations unique values are removed from the result. Defaults to 0.
max_variations (int | None) – If specified only return items with fewer than this number of variations.
default (VT | NoParamType) – if specified, unspecified columns are given this value. Defaults to NoParam.
on_error (str) – Error policy when trying to add a non-hashable type. Default to “raise”. Can be “raise”, “ignore”, or “placeholder”, which will impute a hashable error message.

Returns:

a mapping from each “column” to the set of unique values it took over each “row” and how many times it took that value. If a column is not specified for each row, it is assumed to take a default value, if it is specified.

Return type:

Dict[KT, Dict[VT, int]]

Raises:

KeyError – If default is unspecified and all the rows do not contain the same columns.

References

[SeabornLongform]

https://seaborn.pydata.org/tutorial/data_structure.html#long-form-data

Example

longform = [: {‘a’: ‘on’, ‘b’: ‘red’}, {‘a’: ‘on’, ‘b’: ‘green’}, {‘a’: ‘off’, ‘b’: ‘blue’}, {‘a’: ‘off’, ‘b’: ‘black’},

]

class geowatch.utils.result_analysis.GroupbyFutureWrapper[source]¶

Bases: ObjectProxy

Wraps a groupby object to get the new behavior sooner.

geowatch.utils.result_analysis.fix_groupby(groups)[source]¶

geowatch.utils.result_analysis.aggregate_stats(data, suffix='', group_keys=None)[source]¶

Given columns interpreted as containing stats, aggregate those stats within each group. For each row, any non-group, non-stat column with consistent values across that columns in the group is kept as-is, otherwise the new column for that row is set to None.

Parameters:

data (DataFrame) – a data frame with columns: ‘mean’, ‘std’, ‘min’, ‘max’, and ‘nobs’ (possibly with a suffix)
suffix (str) – if the nobs, std, mean, min, and max have a suffix, specify it
group_keys (List[str]) – pass

Returns:

New dataframe where grouped rows have been aggregated into a single row.

Return type:

DataFrame

Example

>>> import pandas as pd
>>> data = pd.DataFrame([
>>>     #
>>>     {'mean': 8, 'std': 1, 'min': 0, 'max': 1, 'nobs': 2, 'p1': 'a', 'p2': 1},
>>>     {'mean': 6, 'std': 2, 'min': 0, 'max': 1, 'nobs': 3, 'p1': 'a', 'p2': 1},
>>>     {'mean': 7, 'std': 3, 'min': 0, 'max': 2, 'nobs': 5, 'p1': 'a', 'p2': 2},
>>>     {'mean': 5, 'std': 4, 'min': 0, 'max': 3, 'nobs': 7, 'p1': 'a', 'p2': 1},
>>>     #
>>>     {'mean': 3, 'std': 1, 'min': 0, 'max': 20, 'nobs': 6, 'p1': 'b', 'p2': 1},
>>>     {'mean': 0, 'std': 2, 'min': 0, 'max': 20, 'nobs': 26, 'p1': 'b', 'p2': 2},
>>>     {'mean': 9, 'std': 3, 'min': 0, 'max': 20, 'nobs': 496, 'p1': 'b', 'p2': 1},
>>>     #
>>>     {'mean': 5, 'std': 0, 'min': 0, 'max': 1, 'nobs': 2, 'p1': 'c', 'p2': 2},
>>>     {'mean': 5, 'std': 0, 'min': 0, 'max': 1, 'nobs': 7, 'p1': 'c', 'p2': 2},
>>>     #
>>>     {'mean': 5, 'std': 2, 'min': 0, 'max': 2, 'nobs': 7, 'p1': 'd', 'p2': 2},
>>>     #
>>>     {'mean': 5, 'std': 2, 'min': 0, 'max': 2, 'nobs': 7, 'p1': 'e', 'p2': 1},
>>> ])
>>> print(data)
>>> new_data = aggregate_stats(data)
>>> print(new_data)
>>> new_data1 = aggregate_stats(data, group_keys=['p1'])
>>> print(new_data1)
>>> new_data2 = aggregate_stats(data, group_keys=['p2'])
>>> print(new_data2)

geowatch.utils.result_analysis.stats_dict(data, suffix='')[source]¶

geowatch.utils.result_analysis.combine_stats(s1, s2)[source]¶

Helper for combining mean and standard deviation of multiple measurements

Parameters:

s1 (dict) – stats dict containing mean, std, and n
s2 (dict) – stats dict containing mean, std, and n

Example

>>> from geowatch.utils.result_analysis import *  # NOQA
>>> basis = {
>>>     'nobs1': [1, 10, 100, 10000],
>>>     'nobs2': [1, 10, 100, 10000],
>>> }
>>> for params in ub.named_product(basis):
>>>     data1 = np.random.rand(params['nobs1'])
>>>     data2 = np.random.rand(params['nobs2'])
>>>     data3 = np.hstack([data1, data2])
>>>     s1 = stats_dict(data1)
>>>     s2 = stats_dict(data2)
>>>     s3 = stats_dict(data3)
>>>     # Check that our combo works
>>>     combo_s3 = combine_stats(s1, s2)
>>>     compare = pd.DataFrame({'raw': s3, 'combo': combo_s3})
>>>     print(compare)
>>>     assert np.allclose(compare.raw, compare.combo)

References

[SO7753002]

https://stackoverflow.com/questions/7753002/adding-combining-standard-deviations

[SO2971315]

https://math.stackexchange.com/questions/2971315/how-do-i-combine-standard-deviations-of-two-groups

geowatch.utils.result_analysis.combine_stats_arrs(data)[source]¶