etl#

Data wrangling utilities: smoothing, cross-correlation, JSON with complex numbers, time conversions, and more.

matviz.etl.array_pop(X, idx)[source]#

matviz.etl.average_times(time_1, time_2)[source]#

Compute the midpoint between two datetime objects.

Parameters:

time_1datetime: First time.
time_2datetime: Second time.

Returns:

datetime: Midpoint.

matviz.etl.chop(seq, size)[source]#: Chop a sequence into chunks of the given size.

matviz.etl.chopn(seq, n)[source]#: Chop a sequence into chunks of the given size.

matviz.etl.chunk_data(data, window_size, overlap_size=0, flatten_inside_window=True)[source]#

matviz.etl.chunks(l, n)[source]#

Yield successive n-sized chunks from a sequence.

Parameters:

lsequence: Input sequence to chunk.
nint: Chunk size.

Yields:

sequence: Successive chunks of length n (last chunk may be shorter).

matviz.etl.clean_whitespace(my_str)[source]#

matviz.etl.complex_dot(a, b)[source]#

matviz.etl.complex_dump(x)[source]#

matviz.etl.complex_load(txt)[source]#

matviz.etl.complex_noise(n, func=<bound method RandomState.randn of RandomState(MT19937)>)[source]#

Generate random complex numbers.

Parameters:

nint: Number of complex values to generate.
funccallable, optional: Random number generator. Default is np.random.randn.

Returns:

ndarray: Array of n complex random values.

matviz.etl.computeMD5hash(my_string)[source]#

matviz.etl.csvify_dict(weird_dict)[source]#

Convert a column-oriented dictionary into a CSV-like list of rows.

Parameters:

weird_dictdict: Keys are column names, values are lists of column values.

Returns:

list of lists: First row is headers, followed by data rows.

matviz.etl.dictify_cols(df)[source]#

matviz.etl.dictify_cols2(df)[source]#

matviz.etl.dictify_csv(weird_array, headers=None)[source]#

Convert a CSV-like list of rows into a column-oriented dictionary.

Parameters:

weird_arraylist of lists: Row-oriented data. First row is used as keys unless headers is provided.
headerslist of str, optional: Column names. If given, prepended to weird_array.

Returns:

dict: Keys are column names, values are lists of column values.

matviz.etl.drop_mostly_na(df, threshold=0.1, axis=1)[source]#

Drop columns or rows that are mostly NA.

Parameters:

dfDataFrame: Input DataFrame.
thresholdfloat, optional: Drop if NA fraction exceeds this value. Default is 0.1.
axis{0, 1, ‘rows’, ‘columns’}, optional: 0 or 'rows' to drop rows, 1 or 'columns' to drop columns. Default is 1.

Returns:

DataFrame: DataFrame with sparse columns/rows removed.

matviz.etl.dump_json(data_dict, file_path, to_indent=None)[source]#

Save a dict to JSON, encoding NumPy/complex values for round-tripping.

Parameters:

data_dictdict: Data to save.
file_pathstr: Output file path.
to_indentint, optional: JSON indentation level. Default is None (compact).

Returns:

bool: True on success.

matviz.etl.encode_floats(nums, decimals=3)[source]#

Round numeric arrays to fixed decimal places for JSON export.

Parameters:

numsarray-like: Numbers to round.
decimalsint, optional: Number of decimal places. Default is 3.

Returns:

list: Rounded Decimal values as a list.

matviz.etl.find_dom_freq(x, ds, window='hann')[source]#

matviz.etl.find_percentile(value, percentiles)[source]#

Find which percentile bin a value falls into.

Parameters:

valuefloat: The value to look up.
percentilesarray-like: Sorted percentile boundaries.

Returns:

int: Index of the closest percentile.

matviz.etl.first_non_zero_or_nan(x)[source]#

Return the index of the first non-zero element, or NaN if none.

Parameters:

xarray-like: Array to search.

Returns:

int or float: Index of the first non-zero element, or np.nan.

matviz.etl.flatten(values)[source]#

Flatten nested arrays/lists into a single 1D NumPy array.

Parameters:

valuesarray-like or nested list: Nested structure of arrays to flatten.

Returns:

ndarray: Concatenated 1D array.

matviz.etl.flatten_list(list_of_lists)[source]#

matviz.etl.form_day(key)[source]#

matviz.etl.form_year(key)[source]#

matviz.etl.full_group_by(l, key=<function <lambda>>)[source]#

matviz.etl.geometric_median(X, eps=1e-05)[source]#

Compute the geometric median of a set of points.

The geometric median minimizes the sum of Euclidean distances to all points – like a median in 2D or higher dimensions.

Parameters:

Xndarray of shape (n, d): Point cloud.
epsfloat, optional: Convergence threshold. Default is 1e-5.

Returns:

ndarray of shape (d,): The geometric median.

matviz.etl.get_object_size(obj)[source]#

Get the size of a Python object in megabytes.

Parameters:

objobject: Any Python object.

Returns:

str: Human-readable size string.

matviz.etl.get_random_state(seed=12345)[source]#

matviz.etl.handle_dates(X)[source]#

matviz.etl.hex2rgb(color_input)[source]#

Convert a hex color (string or integer) to normalized RGB.

Parameters:

color_inputstr or int: Hex string (e.g. '#FF8800') or hex integer (e.g. 0xFF8800).

Returns:

list of float: [r, g, b] values normalized to 0-1.

matviz.etl.interp_nans(t, y, t_i=None)[source]#

Interpolate over NaN gaps using PCHIP and optionally resample.

Parameters:

tarray-like: Time axis.
yarray-like: Values (NaN positions are interpolated over).
t_iarray-like, optional: New time axis for resampling. Default auto-generates from median spacing.

Returns:

t_indarray: Interpolated time axis.
y_indarray: Interpolated values.

matviz.etl.isdigit(s)[source]#

Check if a value is numeric, including decimal strings.

Parameters:

sany: Value to test.

Returns:

bool: True if s is a number or a numeric string.

matviz.etl.list_depth(seq)[source]#

matviz.etl.load_json(file_path)[source]#

Load a JSON file, restoring any embedded NumPy/complex values.

Parameters:

file_pathstr: Path to the JSON file.

Returns:

dict: Parsed data with complex numbers restored.

matviz.etl.loads_json(json_str)[source]#

Parse a JSON string, restoring any embedded NumPy/complex values.

Parameters:

json_strstr: JSON string.

Returns:

dict: Parsed data with complex numbers restored.

matviz.etl.map_nested_dicts(ob, func)[source]#

matviz.etl.max_lag(x1, x2, ds, max_lag_allowed=inf)[source]#

Find the lag with the highest cross-correlation.

Parameters:

x1array-like: First signal.
x2array-like: Second signal.
dsfloat: Time step between samples.
max_lag_allowedfloat, optional: Maximum allowable lag. Default is infinity.

Returns:

max_lag_outfloat: Lag at the peak correlation.
max_corrfloat: Value of the peak correlation.

matviz.etl.merge_two_dicts(x, y)[source]#

matviz.etl.microsoft_to_timestamp(ts)[source]#

Convert a Microsoft timestamp (100-ns ticks since 1601) to pandas Timestamp.

Parameters:

tsint: Microsoft timestamp value.

Returns:

pandas.Timestamp: Equivalent pandas Timestamp.

matviz.etl.most_common(cur_list)[source]#

matviz.etl.nan_smooth(y, n=5, ens=[], ignore_nans=True)[source]#

Smooth a time series using convolution, handling NaN values gracefully.

Parameters:

yarray-like: Time series to smooth. Supports complex values.
nint or array-like, optional: If int, uses a Hanning window of length n + 2. If array-like, uses it directly as the convolution window. Default is 5.
ensarray-like, optional: Per-point weights (same length as y). Default is ones with zeros at NaN positions.
ignore_nansbool, optional: If True (default), treat NaN positions as missing data.

Returns:

ndarray: Smoothed values, centered, same length as input.

matviz.etl.parse_min_sec(time_str)[source]#: convert normal times into seconds gosh, surprising that there wasn’t already some way to do this robustly in python. Note that this does not work if you’ve got hours

matviz.etl.pprint_entire_df(df)[source]#

matviz.etl.read_csv(name, qt=1)[source]#

matviz.etl.read_string(name)[source]#

matviz.etl.recurse_func(my_list, my_func, stop_level=False)[source]#

Recursively apply a function at a given nesting depth.

Parameters:

my_listlist: A list or nested list.
my_funccallable: Function to apply.
stop_levelint or False, optional: Nesting level at which to apply my_func. False applies at the deepest level. Default is False.

Returns:

list: Same structure as my_list with my_func applied.

matviz.etl.remove_tz(cur_datetime)[source]#

matviz.etl.reverse_dict(tmp_dict)[source]#

matviz.etl.rgb2hex(r, g, b)[source]#

matviz.etl.robust_floater(w)[source]#

Convert a value to a numeric type where possible.

Parameters:

wany: Value to convert.

Returns:

float or numeric

null/NaN values -> np.nan
Timestamps/datetimes -> Unix timestamp (float)
Numeric strings -> float
Non-numeric strings -> np.nan
Numbers -> unchanged
Everything else -> np.nan

matviz.etl.robust_mkdir(desired_dir)[source]#

Create a directory and all parents, ignoring if it already exists.

Parameters:

desired_dirstr or Path: Directory path to create.

matviz.etl.robust_rmdir(cur_dir)[source]#

matviz.etl.rolling_diff(w, n=1)[source]#

matviz.etl.round_time(ts, round_by='H')[source]#

Round timestamps to a given frequency.

Parameters:

tsdatetime or array-like: Timestamp(s) to round.
round_bystr, optional: Pandas frequency string (e.g. 'H', 'T', 'D'). Default is 'H'.

Returns:

Series: Rounded timestamps.

matviz.etl.sort_dict_alphabetically(cur_dict)[source]#

matviz.etl.sort_dict_list(dict_list, k, reverse_param=True)[source]#

matviz.etl.split_list(cur_list, func)[source]#

matviz.etl.sql(query, db)[source]#

Execute a SQL query and return results.

Parameters:

querystr: SQL query string.
dbcursor: Database cursor from a connection.

Returns:

list: Query results.

matviz.etl.sql_redshift(query, db, array)[source]#

matviz.etl.start_and_ends(logical_array)[source]#

Find contiguous True regions in a boolean array.

Parameters:

logical_arrayarray-like of bool: Boolean array to scan.

Returns:

list of (int, int): Start and end index pairs for each contiguous True region.

matviz.etl.subsetter(results, vars)[source]#

matviz.etl.time_delta_to_days(w)[source]#

matviz.etl.time_delta_to_seconds(w)[source]#

matviz.etl.timestamp_to_fraction(dates)[source]#

Convert pandas Timestamps to fraction of day (0.0 to 1.0).

Parameters:

datesSeries of Timestamp: Timestamps to convert.

Returns:

Series of float: Fraction of day elapsed.

matviz.etl.to_tz(cur_tz, local_tz='US/Eastern')[source]#

matviz.etl.tz_to_utc(cur_datetime, local_tz='US/Eastern', native=True)[source]#

matviz.etl.unflatten(flat_values, prototype)[source]#

matviz.etl.utc_to_tz(cur_utc, local_tz='US/Eastern')[source]#

matviz.etl.write_csv(name, array, param='w')[source]#

matviz.etl.write_csv_safe(name, array, param='w')[source]#

Write an array to a CSV file, refusing to overwrite existing files.

Parameters:

namestr: Output file path.
arraylist of lists: Rows to write.
paramstr, optional: File mode. Default is 'w'.

Returns:

bool: True on success.

Raises:

FileExistsError: If the file already exists.

matviz.etl.write_string(name, txt)[source]#

matviz.etl.xcorr(a, b, dt)[source]#

Normalized cross-correlation (MATLAB-style).

Parameters:

aarray-like: First signal.
barray-like: Second signal.
dtfloat: Time step between samples.

Returns:

corrsndarray: Normalized correlation values.
lagsndarray: Lag values in the same units as dt.

etl#

This Page