Estimating Battery Lifetime
Anton A. Nesterov | an (at) vski.sh |
Version | DRAFT |
Data Science has an infinite number of applications accross fields and it is hard to select one that demonstrates it the best. I selected a use case probably anybody can relate to.
In this notebook:
Descriptive Analytics
We'll use a dimensiality reduction technique to get a better "intuitive" understanding of what happens to a battery on charge cycles.
Predictive Analytics
We'll use a linear model to predict a Battery Remaing Useful Time.
Prescriptive Analytics
We're going to identify normal operational parameters for a battery, and ring an alert if something went wrong.
Dataset Description
The Hawaii Natural Energy Institute examined 14 NMC-LCO 18650 batteries with a nominal capacity of 2.8 Ah, which were cycled over 1000 times at 25°C with a CC-CV charge rate of C/2 rate and discharge rate of 1.5C.
Variables:
- Number of charge cycles
- Discharge Time (s)
- Time at 4.15V (s)
- Time Constant Current (s)
- Decrement 3.6-3.4V (s)
- Max. Voltage Discharge (V)
- Min. Voltage Charge (V)
- Charging Time (s)
- Total time (s)
- RUL: Remainging Useful Time (Target)
Initial Assumptions
Leaning on fundametral knowledge, we can intuitivly assume that the Number of Charge Cycles (how often a battery has been charged) has a correlation with battery's useful time, however the other conditions like Min/Max voltage or decrement may have a significant weight. We assume that the other metrics may be corellating with "Battery Manufacturing Quality" or other unknown conditions that may contribute to batettery's useful time.
In fact, because the measurments were taken in the lab conditions (constant temperature and charge/discharge rates) we can assume that all the metrics combined are a "reflection" of a battery quality.
Exploratory Analysis
First step in any Machine Learning effort is Exploratory Data Analysis (EDA). On this step we're assing the data, identifying whether we need to clean, normalize and do other transformations. There are usually two outcomes of this task: 1. A clean dataset that will become part of our model. 2. The dataset assesmets that help us to select type of the model (whether we use a linear model or a neural network).
Exploring the data
We'll use following libraries for EDA:
!pip install pandas tabulate
!pip install seaborn
from IPython.display import display, Markdown
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
printdf = lambda df: display(Markdown(df.to_markdown()))
A file with the battery measurments is called "Battery_RUL.csv". Let's read it and show first 10 rows to understand the structure:
from IPython.display import display, Markdown
df = pd.read_csv("Battery_RUL.csv")
printdf(df.head(3))
Cycle_Index | Discharge Time (s) | Decrement 3.6-3.4V (s) | Max. Voltage Dischar. (V) | Min. Voltage Charg. (V) | Time at 4.15V (s) | Time constant current (s) | Charging time (s) | RUL | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2595.3 | 1151.49 | 3.67 | 3.211 | 5460 | 6755.01 | 10777.8 | 1112 |
1 | 2 | 7408.64 | 1172.51 | 4.246 | 3.22 | 5508.99 | 6762.02 | 10500.4 | 1111 |
2 | 3 | 7393.76 | 1112.99 | 4.249 | 3.224 | 5508.99 | 6762.02 | 10420.4 | 1110 |
Check if the dataset has "unclean data"
The data may inlude duplicates, missing of poorely formated values. Let's check if we need to clean the dataset.
Chek if the dataset has duplicates
Following code shows number of duplicated rows.
df.duplicated().sum()
np.int64(0)
We can see that the dataset is a good quality and free of duplicates. If the dataset had a hight number of duplicates we would use df.drop_duplicates()
.
Check if the dataset has missing values
The second step is to find missing values and select an appropriate method for handling them:
printdf(
df.isna().sum()
)
0 | |
---|---|
Cycle_Index | 0 |
Discharge Time (s) | 0 |
Decrement 3.6-3.4V (s) | 0 |
Max. Voltage Dischar. (V) | 0 |
Min. Voltage Charg. (V) | 0 |
Time at 4.15V (s) | 0 |
Time constant current (s) | 0 |
Charging time (s) | 0 |
RUL | 0 |
We're lucky we picked a clean dataset. We have zero missing values!
handling the duplicates and missing values are very important tasks, especially for small datasets. We never skip this steps even if a dataset appears to be clean. There are a number of techniques to handle missing values, we're not convering them here.
Descriptive analytics
So far we know that our dataset consists of numerical variables and it is well-formed for further analytics. Now we have to get better insights into it.
Get basic statistics
printdf(
df.describe()
)
Cycle_Index | Discharge Time (s) | Decrement 3.6-3.4V (s) | Max. Voltage Dischar. (V) | Min. Voltage Charg. (V) | Time at 4.15V (s) | Time constant current (s) | Charging time (s) | RUL | |
---|---|---|---|---|---|---|---|---|---|
count | 15064 | 15064 | 15064 | 15064 | 15064 | 15064 | 15064 | 15064 | 15064 |
mean | 556.155 | 4581.27 | 1239.78 | 3.90818 | 3.5779 | 3768.34 | 5461.27 | 10066.5 | 554.194 |
std | 322.378 | 33144 | 15039.6 | 0.0910034 | 0.123695 | 9129.55 | 25155.8 | 26415.4 | 322.435 |
min | 1 | 8.69 | -397646 | 3.043 | 3.022 | -113.584 | 5.98 | 5.98 | 0 |
25% | 271 | 1169.31 | 319.6 | 3.846 | 3.488 | 1828.88 | 2564.31 | 7841.92 | 277 |
50% | 560 | 1557.25 | 439.239 | 3.906 | 3.574 | 2930.2 | 3824.26 | 8320.42 | 551 |
75% | 833 | 1908 | 600 | 3.972 | 3.663 | 4088.33 | 5012.35 | 8763.28 | 839 |
max | 1134 | 958320 | 406704 | 4.363 | 4.379 | 245101 | 880728 | 880728 | 1133 |
Look for correlations
Here we can test our initial assumptions. We assumed that number of charges is correlated with a battery life. Let's see if it's verifiable:
printdf(
df[['Cycle_Index', 'RUL']].corr()
)
Cycle_Index | RUL | |
---|---|---|
Cycle_Index | 1 | -0.999756 |
RUL | -0.999756 | 1 |
Plotting Heatmap
sns.heatmap(df.corr(), annot=True, fmt=".2f");
sample = df[['Cycle_Index', 'Max. Voltage Dischar. (V)', 'Min. Voltage Charg. (V)', 'RUL']].sample(500)
sns.pairplot(sample, diag_kind="kde");
PCA - Describing Hidden Metrics.
One of the PCA use cases is retrieving and describing hidden metrics, this is especially valuable for businesses when we need to describe a common indicators that show us a whole picture in one number. Principal Components can be used among other indicators (e.g. "KPI") for deep analytics.
In this case we want to find one "balanced" metric that best describes use of the battery in one number and correlares with battery lifetime. Also, we may need it later for prescritive analytics.
PCA is the first machine learning technique we use. This is a case of unsupervized machine learning
!pip install scikit-learn
We'll remove 'Cycle_Index' and 'RUL' because they have a high correlation, and we want to find a balanced metric applicable to any charge cycle.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
scalar_pca = StandardScaler()
scaled_df_pca = scalar_pca.fit_transform(df.drop(['Cycle_Index', 'RUL'], axis=1))
pca = PCA(n_components = 3)
hidden_variables = pca.fit_transform(scaled_df_pca)
pca_data = pd.DataFrame(hidden_variables, columns=['PC1', 'PC2', 'PC3'])
printdf(
pca_data.head()
)
PC1 | PC2 | PC3 | |
---|---|---|---|
0 | 0.0379614 | -0.36169 | -0.16532 |
1 | 1.81084 | -4.21848 | -0.413291 |
2 | 1.80986 | -4.21844 | -0.414285 |
3 | 1.80832 | -4.22098 | -0.415139 |
4 | 4.56928 | -2.18918 | 0.263026 |
Let's see which PC is better for describing the whole picture, for this we'll check correlation between PCs and battery lifetime:
df_pca = pd.concat([pca_data, df["RUL"]], axis=1)
printdf(
df_pca.head()
)
PC1 | PC2 | PC3 | RUL | |
---|---|---|---|---|
0 | 0.0379614 | -0.36169 | -0.16532 | 1112 |
1 | 1.81084 | -4.21848 | -0.413291 | 1111 |
2 | 1.80986 | -4.21844 | -0.414285 | 1110 |
3 | 1.80832 | -4.22098 | -0.415139 | 1109 |
4 | 4.56928 | -2.18918 | 0.263026 | 1107 |
sns.heatmap(df_pca.corr(), annot=True, fmt=".2f");
On the plot we can see PCs are indeed corellated with the battery lifetime.
It is clear that PC2 is the winner, we'll call this indicator "Battery Utilization Score" or simply "Utilization". You can make up another word, the goal is to interpret data and generalize it in one word.
Normalizing the Usage Metric
Now let's see how the Usage is correlated with the Battery Lifetime:
df_utilization = df_pca[['PC2']].rename(columns={"PC2": "Utilization"})
sns.lineplot(pd.concat([df_utilization, df["RUL"]], axis=1), x="RUL", y="Utilization");
It is clear that this compmponent is negatively correlated with RUL (the less the better). Let's normalize it for better interpretation.
In this case it is enough to simly invert the sign:
df_utilization = df_utilization.apply(lambda x: x.mul(-1), axis=0)
sns.lineplot(pd.concat([df_utilization, df["RUL"]], axis=1), x="RUL", y="Utilization");
On the plot we see some outliers, for example we can see that in some cases bad Battery Usage doesn't affect battery life time. Here we have to remember that we dropped number of charge cycles and one principal component which may explain the outliers on this plot. The point of this metric is to generalize and describe what's happening to the battery in one word (or rather a number).
Metric Normalization (for humans)
The value itself isn't normalized enough, to explain this metric to a human, we would need to normalize it further, for example, showing it as a score between 1 and 10. Let's do just that:
First thing we notice is that the values are spread out bettween -20 and 5 and the most of the values are in range -2.5 to 2.5:
sns.histplot(df_utilization);
In this case the adequate normalization technique would be clipping followed by scaling:
# clip outliers, shift the range from -2.5..2.5 to 0..5, and scale to 0..10
df_utilization['Utilization'] = df_utilization['Utilization'].apply(lambda x: ((2.5 if x > 2.5 else -2.5 if x < -2.5 else x) + 2.5) / 5 * 10)
# Normalization Function (for the future)
def normalize_battery_utilization(x):
value = 2.5 if x > 2.5 else -2.5 if x < -2.5 else x
return (value + 2.5) / 5.0 * 10.0
sns.histplot(df_utilization);
sns.regplot(pd.concat([df_utilization, df["RUL"]], axis=1), x="RUL", y="Utilization", line_kws=dict(color="y"), marker=".", ci=99);
Now our data is generalized enough to explain it for a human. So we can give a simple interpretation for the models we create next! We say "Battery usage is high!" instead of explaining voltages, capacity and other physics!
Building a model for a business we would find Business Performance Indicators using similar methods.
We used PCA to generalize the dataset and interpret it better. This is a simple case of a dimensionality reduction technique, we use similar techniques on large datasets in order to optimize the models.
Selecting Models
Defining objective
We have following goals:
- predict remaining battery usable time (predictive)
- maximizing battery life (prescriptive)
- assess battery utilization quality (prescriptive)
Initial Assumptions
Predictive Analytics
EDA has shown that most os the variables in the dataset have high correlation rates with remaining useful life (RUL) and each other. Also, we've found a principal component (Utilization) which correlates with RUL and can be used to fit a simple regression.
However, the correlation matrix has shown that some variables also correlate with each other while contributing to RUL. This may suggest there is some small degree of non-linearity.
sns.heatmap(df.corr(), annot=True);
Except neural networks there are various approaches to capture non-linearity when we extrapolate a target variable.
We'll cross-validate two models - Nearest Neighbour Regression and a regression based on One Layer DNN.
KNN
A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbors. Another approach uses an inverse distance weighted average of the K nearest neighbors. KNN regression uses the same distance functions as KNN classification.
Eucledian Distance
The distance formula would apply to nearest neighbours, so the interpretation would be:
Similar technique can be applied combined with a vector database
Linear RegressionThe output of linear regression for one independent variable is represented as follows: Y = a + bX
; where a is the y-intercept (bias), b is the slope (weight), X is independent variable.
A neuron output with a linear activation function is practically the same:
Prescriptive Analytics
Usually prescriptive analytics discussed in terms of maximizing profits. Formally, prescriptive analytics deals with optimization problems in general. In this case we can try to maximize battery life and check if a battery utilization quality is within the norm.
Previously we've observed that the battery life has a high correlation with the number of charge cycles. Also we've discovered a "hidden metrics" - "Utilization" which describes battery utilization score. Let's see how those two correlate:
sns.regplot(pd.concat([df_utilization, df["Cycle_Index"]], axis=1), x="Cycle_Index", y="Utilization", line_kws=dict(color="y"), marker=".", ci=99);
This graph can be interpreted as "The less charging cycles have passed - the higher is Battery Utilization Score".
Another thing this plot tels us is what would be "a normal battery utilization", for example, we can see that after 500 charging cycles the battery's normal utilization score would be somewhere around 6.
In another words. If the battery score is within the norm or higher, we consider the battery is in a good condition and there's no external factors (e.g. a shortcut). On the other hand if the score is below the norm, then we may need to ring a bell because somthing is wrong.
Assesing overall battery quality may require an estimation taken from a sample of charging cycles. In this case would take an agngle between normal line and estimated from a sample.
We can Utilization Score's regression and standard deviation as a baseline for assessing if a charging cycle is within normal operational parameters:
Validating Predictive Models
Following dependecies will be used with all models:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Scaled dataset
scalar = StandardScaler()
scaled_df= pd.DataFrame(scalar.fit_transform(df), columns=df.columns)
X, y = scaled_df.drop(["RUL"], axis=1), scaled_df['RUL']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
KNN Regressve Model
from sklearn.neighbors import KNeighborsRegressor
knn_regressor = KNeighborsRegressor(n_neighbors=2, weights='distance', p=2)
knn_regressor.fit(X_train, y_train)
knn_y_pred = knn_regressor.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_y_pred)
knn_r2 = r2_score(y_test.to_numpy(), knn_y_pred)
print(f'R-squared: {knn_r2}')
print(f'MSE: {knn_mse}')
knn_result = pd.DataFrame({
"Actual": y_test.to_numpy(),
"Predicted": knn_y_pred
})
sns.scatterplot(knn_result.sample(100));
R-squared: 0.9975192799431646 MSE: 0.002471854906905174
Linear Regression
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X_train, y_train)
lr_y_pred = regr.predict(X_test)
lr_r2 = r2_score(y_test.to_numpy(), lr_y_pred)
lr_mse = mean_squared_error(y_test, lr_y_pred)
print(f'R-squared: {lr_r2}')
print(f'MSE: {lr_mse}')
mlp_result = pd.DataFrame({
"Actual": y_test.to_numpy(),
"Predicted": lr_y_pred
})
sns.scatterplot(mlp_result.sample(100));
R-squared: 0.9994734493523346 MSE: 0.0005246689559264891
As we can see the models have a similar precision. I'll attribute a high precision to good dataset quality. Talking about the model choice I would lean in favour KNN or similar models, because similar teqniques could be applied with the clusters stored in a database.
Prescriptive Analysis
Prescriptive analysis often depend on the domain and application — a data scientist is not a physiscist or a chemist and cannot decide what is right or wrong with a battery. However, we can assess if something is different with a battery utilization and ring an alert. In other words, based on previous observations we can tell if a battery is normal or not.
prescr_df = pd.concat([df_utilization, df["Cycle_Index"]], axis=1)
printdf(prescr_df.head())
Utilization | Cycle_Index | |
---|---|---|
0 | 5.72338 | 1 |
1 | 10 | 2 |
2 | 10 | 3 |
3 | 10 | 4 |
4 | 9.37837 | 6 |
from scipy.stats import linregress
reg = linregress(df["Cycle_Index"], prescr_df['Utilization'])
baseline = reg.intercept + prescr_df['Cycle_Index'].mul(reg.slope)
baseline_df = pd.DataFrame({
"Cycle_Index": df["Cycle_Index"],
"Utilization": prescr_df['Utilization'],
"Baseline" : baseline,
"Deviation": prescr_df['Utilization'] - baseline
})
normal_utilization_std = prescr_df['Utilization'].std()
print("Utilization Standard Deviation", normal_utilization_std)
Utilization Standard Deviation 2.1520489185860763
plt.plot(baseline_df['Cycle_Index'], baseline_df['Utilization'], 'o', label='original data')
plt.plot(baseline_df['Cycle_Index'], baseline_df['Baseline'], 'r', label='Baseline')
plt.legend()
plt.show()
estaimtions = abs(df_utilization['Utilization'] - baseline_df['Baseline']) <= normal_utilization_std
estaimtions.value_counts()
True 14192 False 872 Name: count, dtype: int64
test_df = pd.DataFrame({
'Cycle_Index': [150],
'Discharge Time (s)': [7408.64],
'Decrement 3.6-3.4V (s)': [1172.512500],
'Max. Voltage Dischar. (V)': [4.246],
'Min. Voltage Charg. (V)': [3.220],
'Time at 4.15V (s)': [5508.992],
'Time constant current (s)': [6762.02],
'Charging time (s)': [10500.35],
})
def sample_quality(sample: pd.DataFrame, baseline = baseline_df):
pca_df = sample.drop("Cycle_Index", axis=1)
sample_pca = pd.DataFrame(scalar_pca.transform(pca_df), columns=pca_df.columns)
metrics = pd.DataFrame(-pca.transform(sample_pca), columns=["PC1", "PC2", "PC3"]).drop(['PC1', "PC3"], axis=1)
metrics = metrics.clip(-2.5, 2.5).apply(lambda x: (x + 2.5) / 5 * 10 , axis=1)
scores = pd.DataFrame({
"Cycle_Index": sample['Cycle_Index'],
"Score": metrics['PC2'],
})
scores["Deviation"] = scores.apply(lambda row: normal_utilization_std - abs(row['Score'] \
- baseline['Baseline'][int(row['Cycle_Index'])]), axis=1)
scores['Is Normal?'] = scores.apply(lambda row: int(row['Deviation'] <= 0), axis=1)
return scores
# is_normal_utilization = normal_utilization_std - abs(utilization - baseline_df['Baseline'][charging_cycle]) <= 0
# return is_normal_utilization
printdf(
sample_quality(test_df)
)
/home/anton/.pyenv/versions/rnd/lib/python3.13/site-packages/sklearn/utils/validation.py:2742: UserWarning: X has feature names, but PCA was fitted without feature names warnings.warn(
Cycle_Index | Score | Deviation | Is Normal? | |
---|---|---|---|---|
0 | 150 | 10 | -0.422654 | 1 |
This battery parameters are normal for a unit that has been used a 100 times, but if this battery would be charged only 50 times and have the same parameters we may assume the battery is "degrading" too fast. For more deep analysis we'd need to assess other variables, but one balanced metric is enough to tell us that soething is wrong.