etl#

Data wrangling utilities: smoothing, cross-correlation, JSON with complex numbers, time conversions, and more.

matviz.etl.array_pop(X, idx)[source]#
matviz.etl.average_times(time_1, time_2)[source]#

Compute the midpoint between two datetime objects.

Parameters:
time_1datetime

First time.

time_2datetime

Second time.

Returns:
datetime

Midpoint.

matviz.etl.chop(seq, size)[source]#

Chop a sequence into chunks of the given size.

matviz.etl.chopn(seq, n)[source]#

Chop a sequence into chunks of the given size.

matviz.etl.chunk_data(data, window_size, overlap_size=0, flatten_inside_window=True)[source]#
matviz.etl.chunks(l, n)[source]#

Yield successive n-sized chunks from a sequence.

Parameters:
lsequence

Input sequence to chunk.

nint

Chunk size.

Yields:
sequence

Successive chunks of length n (last chunk may be shorter).

matviz.etl.clean_whitespace(my_str)[source]#
matviz.etl.complex_dot(a, b)[source]#
matviz.etl.complex_dump(x)[source]#
matviz.etl.complex_load(txt)[source]#
matviz.etl.complex_noise(n, func=<bound method RandomState.randn of RandomState(MT19937)>)[source]#

Generate random complex numbers.

Parameters:
nint

Number of complex values to generate.

funccallable, optional

Random number generator. Default is np.random.randn.

Returns:
ndarray

Array of n complex random values.

matviz.etl.computeMD5hash(my_string)[source]#
matviz.etl.csvify_dict(weird_dict)[source]#

Convert a column-oriented dictionary into a CSV-like list of rows.

Parameters:
weird_dictdict

Keys are column names, values are lists of column values.

Returns:
list of lists

First row is headers, followed by data rows.

matviz.etl.dictify_cols(df)[source]#
matviz.etl.dictify_cols2(df)[source]#
matviz.etl.dictify_csv(weird_array, headers=None)[source]#

Convert a CSV-like list of rows into a column-oriented dictionary.

Parameters:
weird_arraylist of lists

Row-oriented data. First row is used as keys unless headers is provided.

headerslist of str, optional

Column names. If given, prepended to weird_array.

Returns:
dict

Keys are column names, values are lists of column values.

matviz.etl.drop_mostly_na(df, threshold=0.1, axis=1)[source]#

Drop columns or rows that are mostly NA.

Parameters:
dfDataFrame

Input DataFrame.

thresholdfloat, optional

Drop if NA fraction exceeds this value. Default is 0.1.

axis{0, 1, ‘rows’, ‘columns’}, optional

0 or 'rows' to drop rows, 1 or 'columns' to drop columns. Default is 1.

Returns:
DataFrame

DataFrame with sparse columns/rows removed.

matviz.etl.dump_json(data_dict, file_path, to_indent=None)[source]#

Save a dict to JSON, encoding NumPy/complex values for round-tripping.

Parameters:
data_dictdict

Data to save.

file_pathstr

Output file path.

to_indentint, optional

JSON indentation level. Default is None (compact).

Returns:
bool

True on success.

matviz.etl.encode_floats(nums, decimals=3)[source]#

Round numeric arrays to fixed decimal places for JSON export.

Parameters:
numsarray-like

Numbers to round.

decimalsint, optional

Number of decimal places. Default is 3.

Returns:
list

Rounded Decimal values as a list.

matviz.etl.find_dom_freq(x, ds, window='hann')[source]#
matviz.etl.find_percentile(value, percentiles)[source]#

Find which percentile bin a value falls into.

Parameters:
valuefloat

The value to look up.

percentilesarray-like

Sorted percentile boundaries.

Returns:
int

Index of the closest percentile.

matviz.etl.first_non_zero_or_nan(x)[source]#

Return the index of the first non-zero element, or NaN if none.

Parameters:
xarray-like

Array to search.

Returns:
int or float

Index of the first non-zero element, or np.nan.

matviz.etl.flatten(values)[source]#

Flatten nested arrays/lists into a single 1D NumPy array.

Parameters:
valuesarray-like or nested list

Nested structure of arrays to flatten.

Returns:
ndarray

Concatenated 1D array.

matviz.etl.flatten_list(list_of_lists)[source]#
matviz.etl.form_day(key)[source]#
matviz.etl.form_year(key)[source]#
matviz.etl.full_group_by(l, key=<function <lambda>>)[source]#
matviz.etl.geometric_median(X, eps=1e-05)[source]#

Compute the geometric median of a set of points.

The geometric median minimizes the sum of Euclidean distances to all points – like a median in 2D or higher dimensions.

Parameters:
Xndarray of shape (n, d)

Point cloud.

epsfloat, optional

Convergence threshold. Default is 1e-5.

Returns:
ndarray of shape (d,)

The geometric median.

matviz.etl.get_object_size(obj)[source]#

Get the size of a Python object in megabytes.

Parameters:
objobject

Any Python object.

Returns:
str

Human-readable size string.

matviz.etl.get_random_state(seed=12345)[source]#
matviz.etl.handle_dates(X)[source]#
matviz.etl.hex2rgb(color_input)[source]#

Convert a hex color (string or integer) to normalized RGB.

Parameters:
color_inputstr or int

Hex string (e.g. '#FF8800') or hex integer (e.g. 0xFF8800).

Returns:
list of float

[r, g, b] values normalized to 0-1.

matviz.etl.interp_nans(t, y, t_i=None)[source]#

Interpolate over NaN gaps using PCHIP and optionally resample.

Parameters:
tarray-like

Time axis.

yarray-like

Values (NaN positions are interpolated over).

t_iarray-like, optional

New time axis for resampling. Default auto-generates from median spacing.

Returns:
t_indarray

Interpolated time axis.

y_indarray

Interpolated values.

matviz.etl.isdigit(s)[source]#

Check if a value is numeric, including decimal strings.

Parameters:
sany

Value to test.

Returns:
bool

True if s is a number or a numeric string.

matviz.etl.list_depth(seq)[source]#
matviz.etl.load_json(file_path)[source]#

Load a JSON file, restoring any embedded NumPy/complex values.

Parameters:
file_pathstr

Path to the JSON file.

Returns:
dict

Parsed data with complex numbers restored.

matviz.etl.loads_json(json_str)[source]#

Parse a JSON string, restoring any embedded NumPy/complex values.

Parameters:
json_strstr

JSON string.

Returns:
dict

Parsed data with complex numbers restored.

matviz.etl.map_nested_dicts(ob, func)[source]#
matviz.etl.max_lag(x1, x2, ds, max_lag_allowed=inf)[source]#

Find the lag with the highest cross-correlation.

Parameters:
x1array-like

First signal.

x2array-like

Second signal.

dsfloat

Time step between samples.

max_lag_allowedfloat, optional

Maximum allowable lag. Default is infinity.

Returns:
max_lag_outfloat

Lag at the peak correlation.

max_corrfloat

Value of the peak correlation.

matviz.etl.merge_two_dicts(x, y)[source]#
matviz.etl.microsoft_to_timestamp(ts)[source]#

Convert a Microsoft timestamp (100-ns ticks since 1601) to pandas Timestamp.

Parameters:
tsint

Microsoft timestamp value.

Returns:
pandas.Timestamp

Equivalent pandas Timestamp.

matviz.etl.most_common(cur_list)[source]#
matviz.etl.nan_smooth(y, n=5, ens=[], ignore_nans=True)[source]#

Smooth a time series using convolution, handling NaN values gracefully.

Parameters:
yarray-like

Time series to smooth. Supports complex values.

nint or array-like, optional

If int, uses a Hanning window of length n + 2. If array-like, uses it directly as the convolution window. Default is 5.

ensarray-like, optional

Per-point weights (same length as y). Default is ones with zeros at NaN positions.

ignore_nansbool, optional

If True (default), treat NaN positions as missing data.

Returns:
ndarray

Smoothed values, centered, same length as input.

matviz.etl.parse_min_sec(time_str)[source]#

convert normal times into seconds gosh, surprising that there wasn’t already some way to do this robustly in python. Note that this does not work if you’ve got hours

matviz.etl.pprint_entire_df(df)[source]#
matviz.etl.read_csv(name, qt=1)[source]#
matviz.etl.read_string(name)[source]#
matviz.etl.recurse_func(my_list, my_func, stop_level=False)[source]#

Recursively apply a function at a given nesting depth.

Parameters:
my_listlist

A list or nested list.

my_funccallable

Function to apply.

stop_levelint or False, optional

Nesting level at which to apply my_func. False applies at the deepest level. Default is False.

Returns:
list

Same structure as my_list with my_func applied.

matviz.etl.remove_tz(cur_datetime)[source]#
matviz.etl.reverse_dict(tmp_dict)[source]#
matviz.etl.rgb2hex(r, g, b)[source]#
matviz.etl.robust_floater(w)[source]#

Convert a value to a numeric type where possible.

Parameters:
wany

Value to convert.

Returns:
float or numeric
  • null/NaN values -> np.nan

  • Timestamps/datetimes -> Unix timestamp (float)

  • Numeric strings -> float

  • Non-numeric strings -> np.nan

  • Numbers -> unchanged

  • Everything else -> np.nan

matviz.etl.robust_mkdir(desired_dir)[source]#

Create a directory and all parents, ignoring if it already exists.

Parameters:
desired_dirstr or Path

Directory path to create.

matviz.etl.robust_rmdir(cur_dir)[source]#
matviz.etl.rolling_diff(w, n=1)[source]#
matviz.etl.round_time(ts, round_by='H')[source]#

Round timestamps to a given frequency.

Parameters:
tsdatetime or array-like

Timestamp(s) to round.

round_bystr, optional

Pandas frequency string (e.g. 'H', 'T', 'D'). Default is 'H'.

Returns:
Series

Rounded timestamps.

matviz.etl.sort_dict_alphabetically(cur_dict)[source]#
matviz.etl.sort_dict_list(dict_list, k, reverse_param=True)[source]#
matviz.etl.split_list(cur_list, func)[source]#
matviz.etl.sql(query, db)[source]#

Execute a SQL query and return results.

Parameters:
querystr

SQL query string.

dbcursor

Database cursor from a connection.

Returns:
list

Query results.

matviz.etl.sql_redshift(query, db, array)[source]#
matviz.etl.start_and_ends(logical_array)[source]#

Find contiguous True regions in a boolean array.

Parameters:
logical_arrayarray-like of bool

Boolean array to scan.

Returns:
list of (int, int)

Start and end index pairs for each contiguous True region.

matviz.etl.subsetter(results, vars)[source]#
matviz.etl.time_delta_to_days(w)[source]#
matviz.etl.time_delta_to_seconds(w)[source]#
matviz.etl.timestamp_to_fraction(dates)[source]#

Convert pandas Timestamps to fraction of day (0.0 to 1.0).

Parameters:
datesSeries of Timestamp

Timestamps to convert.

Returns:
Series of float

Fraction of day elapsed.

matviz.etl.to_tz(cur_tz, local_tz='US/Eastern')[source]#
matviz.etl.tz_to_utc(cur_datetime, local_tz='US/Eastern', native=True)[source]#
matviz.etl.unflatten(flat_values, prototype)[source]#
matviz.etl.utc_to_tz(cur_utc, local_tz='US/Eastern')[source]#
matviz.etl.write_csv(name, array, param='w')[source]#
matviz.etl.write_csv_safe(name, array, param='w')[source]#

Write an array to a CSV file, refusing to overwrite existing files.

Parameters:
namestr

Output file path.

arraylist of lists

Rows to write.

paramstr, optional

File mode. Default is 'w'.

Returns:
bool

True on success.

Raises:
FileExistsError

If the file already exists.

matviz.etl.write_string(name, txt)[source]#
matviz.etl.xcorr(a, b, dt)[source]#

Normalized cross-correlation (MATLAB-style).

Parameters:
aarray-like

First signal.

barray-like

Second signal.

dtfloat

Time step between samples.

Returns:
corrsndarray

Normalized correlation values.

lagsndarray

Lag values in the same units as dt.