etl#
Data wrangling utilities: smoothing, cross-correlation, JSON with complex numbers, time conversions, and more.
- matviz.etl.average_times(time_1, time_2)[source]#
Compute the midpoint between two datetime objects.
- Parameters:
- time_1datetime
First time.
- time_2datetime
Second time.
- Returns:
- datetime
Midpoint.
- matviz.etl.chunks(l, n)[source]#
Yield successive n-sized chunks from a sequence.
- Parameters:
- lsequence
Input sequence to chunk.
- nint
Chunk size.
- Yields:
- sequence
Successive chunks of length n (last chunk may be shorter).
- matviz.etl.complex_noise(n, func=<bound method RandomState.randn of RandomState(MT19937)>)[source]#
Generate random complex numbers.
- Parameters:
- nint
Number of complex values to generate.
- funccallable, optional
Random number generator. Default is
np.random.randn.
- Returns:
- ndarray
Array of n complex random values.
- matviz.etl.csvify_dict(weird_dict)[source]#
Convert a column-oriented dictionary into a CSV-like list of rows.
- Parameters:
- weird_dictdict
Keys are column names, values are lists of column values.
- Returns:
- list of lists
First row is headers, followed by data rows.
- matviz.etl.dictify_csv(weird_array, headers=None)[source]#
Convert a CSV-like list of rows into a column-oriented dictionary.
- Parameters:
- weird_arraylist of lists
Row-oriented data. First row is used as keys unless headers is provided.
- headerslist of str, optional
Column names. If given, prepended to weird_array.
- Returns:
- dict
Keys are column names, values are lists of column values.
- matviz.etl.drop_mostly_na(df, threshold=0.1, axis=1)[source]#
Drop columns or rows that are mostly NA.
- Parameters:
- dfDataFrame
Input DataFrame.
- thresholdfloat, optional
Drop if NA fraction exceeds this value. Default is 0.1.
- axis{0, 1, ‘rows’, ‘columns’}, optional
0 or
'rows'to drop rows, 1 or'columns'to drop columns. Default is 1.
- Returns:
- DataFrame
DataFrame with sparse columns/rows removed.
- matviz.etl.dump_json(data_dict, file_path, to_indent=None)[source]#
Save a dict to JSON, encoding NumPy/complex values for round-tripping.
- Parameters:
- data_dictdict
Data to save.
- file_pathstr
Output file path.
- to_indentint, optional
JSON indentation level. Default is None (compact).
- Returns:
- bool
True on success.
- matviz.etl.encode_floats(nums, decimals=3)[source]#
Round numeric arrays to fixed decimal places for JSON export.
- Parameters:
- numsarray-like
Numbers to round.
- decimalsint, optional
Number of decimal places. Default is 3.
- Returns:
- list
Rounded Decimal values as a list.
- matviz.etl.find_percentile(value, percentiles)[source]#
Find which percentile bin a value falls into.
- Parameters:
- valuefloat
The value to look up.
- percentilesarray-like
Sorted percentile boundaries.
- Returns:
- int
Index of the closest percentile.
- matviz.etl.first_non_zero_or_nan(x)[source]#
Return the index of the first non-zero element, or NaN if none.
- Parameters:
- xarray-like
Array to search.
- Returns:
- int or float
Index of the first non-zero element, or
np.nan.
- matviz.etl.flatten(values)[source]#
Flatten nested arrays/lists into a single 1D NumPy array.
- Parameters:
- valuesarray-like or nested list
Nested structure of arrays to flatten.
- Returns:
- ndarray
Concatenated 1D array.
- matviz.etl.geometric_median(X, eps=1e-05)[source]#
Compute the geometric median of a set of points.
The geometric median minimizes the sum of Euclidean distances to all points – like a median in 2D or higher dimensions.
- Parameters:
- Xndarray of shape (n, d)
Point cloud.
- epsfloat, optional
Convergence threshold. Default is 1e-5.
- Returns:
- ndarray of shape (d,)
The geometric median.
- matviz.etl.get_object_size(obj)[source]#
Get the size of a Python object in megabytes.
- Parameters:
- objobject
Any Python object.
- Returns:
- str
Human-readable size string.
- matviz.etl.hex2rgb(color_input)[source]#
Convert a hex color (string or integer) to normalized RGB.
- Parameters:
- color_inputstr or int
Hex string (e.g.
'#FF8800') or hex integer (e.g.0xFF8800).
- Returns:
- list of float
[r, g, b]values normalized to 0-1.
- matviz.etl.interp_nans(t, y, t_i=None)[source]#
Interpolate over NaN gaps using PCHIP and optionally resample.
- Parameters:
- tarray-like
Time axis.
- yarray-like
Values (NaN positions are interpolated over).
- t_iarray-like, optional
New time axis for resampling. Default auto-generates from median spacing.
- Returns:
- t_indarray
Interpolated time axis.
- y_indarray
Interpolated values.
- matviz.etl.isdigit(s)[source]#
Check if a value is numeric, including decimal strings.
- Parameters:
- sany
Value to test.
- Returns:
- bool
True if s is a number or a numeric string.
- matviz.etl.load_json(file_path)[source]#
Load a JSON file, restoring any embedded NumPy/complex values.
- Parameters:
- file_pathstr
Path to the JSON file.
- Returns:
- dict
Parsed data with complex numbers restored.
- matviz.etl.loads_json(json_str)[source]#
Parse a JSON string, restoring any embedded NumPy/complex values.
- Parameters:
- json_strstr
JSON string.
- Returns:
- dict
Parsed data with complex numbers restored.
- matviz.etl.max_lag(x1, x2, ds, max_lag_allowed=inf)[source]#
Find the lag with the highest cross-correlation.
- Parameters:
- x1array-like
First signal.
- x2array-like
Second signal.
- dsfloat
Time step between samples.
- max_lag_allowedfloat, optional
Maximum allowable lag. Default is infinity.
- Returns:
- max_lag_outfloat
Lag at the peak correlation.
- max_corrfloat
Value of the peak correlation.
- matviz.etl.microsoft_to_timestamp(ts)[source]#
Convert a Microsoft timestamp (100-ns ticks since 1601) to pandas Timestamp.
- Parameters:
- tsint
Microsoft timestamp value.
- Returns:
- pandas.Timestamp
Equivalent pandas Timestamp.
- matviz.etl.nan_smooth(y, n=5, ens=[], ignore_nans=True)[source]#
Smooth a time series using convolution, handling NaN values gracefully.
- Parameters:
- yarray-like
Time series to smooth. Supports complex values.
- nint or array-like, optional
If int, uses a Hanning window of length
n + 2. If array-like, uses it directly as the convolution window. Default is 5.- ensarray-like, optional
Per-point weights (same length as y). Default is ones with zeros at NaN positions.
- ignore_nansbool, optional
If True (default), treat NaN positions as missing data.
- Returns:
- ndarray
Smoothed values, centered, same length as input.
- matviz.etl.parse_min_sec(time_str)[source]#
convert normal times into seconds gosh, surprising that there wasn’t already some way to do this robustly in python. Note that this does not work if you’ve got hours
- matviz.etl.recurse_func(my_list, my_func, stop_level=False)[source]#
Recursively apply a function at a given nesting depth.
- Parameters:
- my_listlist
A list or nested list.
- my_funccallable
Function to apply.
- stop_levelint or False, optional
Nesting level at which to apply my_func. False applies at the deepest level. Default is False.
- Returns:
- list
Same structure as my_list with my_func applied.
- matviz.etl.robust_floater(w)[source]#
Convert a value to a numeric type where possible.
- Parameters:
- wany
Value to convert.
- Returns:
- float or numeric
null/NaN values -> np.nan
Timestamps/datetimes -> Unix timestamp (float)
Numeric strings -> float
Non-numeric strings -> np.nan
Numbers -> unchanged
Everything else -> np.nan
- matviz.etl.robust_mkdir(desired_dir)[source]#
Create a directory and all parents, ignoring if it already exists.
- Parameters:
- desired_dirstr or Path
Directory path to create.
- matviz.etl.round_time(ts, round_by='H')[source]#
Round timestamps to a given frequency.
- Parameters:
- tsdatetime or array-like
Timestamp(s) to round.
- round_bystr, optional
Pandas frequency string (e.g.
'H','T','D'). Default is'H'.
- Returns:
- Series
Rounded timestamps.
- matviz.etl.sql(query, db)[source]#
Execute a SQL query and return results.
- Parameters:
- querystr
SQL query string.
- dbcursor
Database cursor from a connection.
- Returns:
- list
Query results.
- matviz.etl.start_and_ends(logical_array)[source]#
Find contiguous True regions in a boolean array.
- Parameters:
- logical_arrayarray-like of bool
Boolean array to scan.
- Returns:
- list of (int, int)
Start and end index pairs for each contiguous True region.
- matviz.etl.timestamp_to_fraction(dates)[source]#
Convert pandas Timestamps to fraction of day (0.0 to 1.0).
- Parameters:
- datesSeries of Timestamp
Timestamps to convert.
- Returns:
- Series of float
Fraction of day elapsed.
- matviz.etl.write_csv_safe(name, array, param='w')[source]#
Write an array to a CSV file, refusing to overwrite existing files.
- Parameters:
- namestr
Output file path.
- arraylist of lists
Rows to write.
- paramstr, optional
File mode. Default is
'w'.
- Returns:
- bool
True on success.
- Raises:
- FileExistsError
If the file already exists.