HappyWhale 🐳:PyTorch Training from scratch Lite ⚡️
Learn how to write a custom training loop in pure PyTorch, create custom torch Dataset class, compute metrics for model performance, and Scale the Training on any hardware like GPU, TPU, IPU or Distributed Training with LightningLite.
In this Notebook article, you will learn how to write a custom training loop in pure PyTorch, create custom torch Dataset
class, compute metrics for model performance, and Scale the Training on any hardware like GPU, TPU, IPU or Distributed Training with LightningLite.
Checkout the original Kaggle Notebook here.
🕵 Explore the provided data
(EDA is taken from Notebook of Jirka)
!ls -l /kaggle/input/happy-whale-and-dolphin
PATH_DATASET = "/kaggle/input/happy-whale-and-dolphin"
import os
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
sn.set()
df_train = pd.read_csv(os.path.join(PATH_DATASET, "train.csv"))
display(df_train.head())
print(f"Dataset size: {len(df_train)}")
print(f"Unique ids: {len(df_train['individual_id'].unique())}")
Lets see how many speaced we have in the database...
counts_imgs = df_train["species"].value_counts()
counts_inds = df_train.drop_duplicates("individual_id")["species"].value_counts()
ax = pd.concat({"per Images": counts_imgs, "per Individuals": counts_inds}, axis=1).plot.barh(grid=True, figsize=(7, 10))
ax.set_xscale('log')
And compare they with unique individuals...
Note: that the counts are in log scale
import numpy as np
from pprint import pprint
species_individuals = {}
for name, dfg in df_train.groupby("species"):
species_individuals[name] = dfg["individual_id"].value_counts()
si_max = max(list(map(len, species_individuals.values())))
si = {n: [0] * si_max for n in species_individuals}
for n, counts in species_individuals.items():
si[n][:len(counts)] = list(np.log(counts))
si = pd.DataFrame(si)
import seaborn as sn
fig = plt.figure(figsize=(10, 8))
ax = sn.heatmap(si[:500].T, cmap="BuGn", ax=fig.gca())
And see the top individulas
ax = df_train["individual_id"].value_counts(ascending=True)[-50:].plot.barh(figsize=(3, 8), grid=True) # ascending=True
nb_species = len(df_train["species"].unique())
fig, axarr = plt.subplots(ncols=5, nrows=nb_species, figsize=(12, nb_species * 2))
for i, (name, dfg) in enumerate(df_train.groupby("species")):
axarr[i, 0].set_title(name)
for j, (_, row) in enumerate(dfg[:5].iterrows()):
im_path = os.path.join(PATH_DATASET, "train_images", row["image"])
img = plt.imread(im_path)
axarr[i, j].imshow(img)
axarr[i, j].set_axis_off()