译(五十二)-Pytorch切分数据集为训练集与测试集
如有翻译问题欢迎评论指出,谢谢。
如何将数据集分为训练集与测试集?
nirvair asked:
import pandas as pd import numpy as np import cv2 from torch.utils.data.dataset import Dataset # class CustomDatasetFromCSV(Dataset): def __init__(self, csv_path, transform=None): self.data = pd.read_csv(csv_path) self.labels = pd.get_dummies(self.data['emotion']).as_matrix() self.height = 48 self.width = 48 self.transform = transform # def __getitem__(self, index): pixels = self.data['pixels'].tolist() faces = [] for pixel_sequence in pixels: face = [int(pixel) for pixel in pixel_sequence.split(' ')] # print(np.asarray(face).shape) face = np.asarray(face).reshape(self.width, self.height) face = cv2.resize(face.astype('uint8'), (self.width, self.height)) faces.append(face.astype('float32')) faces = np.asarray(faces) faces = np.expand_dims(faces, -1) return faces, self.labels # def __len__(self): return len(self.data)
这段代码是我从其它地方参考的,但我还想将数据集分为训练集和测试集。
能在这个类里面直接实现嘛?还是需要分开来实现?
Answers:
Fábio Perez - vote: 156
Pytorch 0.4.1 以上版本可以使用
random_split
:train_size = int(0.8 * len(full_dataset)) test_size = len(full_dataset) - train_size train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
benjaminplanche - vote: 127
试试 Pytorch 的
SubsetRandomSampler
:import torch import numpy as np from torchvision import datasets from torchvision import transforms from torch.utils.data.sampler import SubsetRandomSampler # class CustomDatasetFromCSV(Dataset): def __init__(self, csv_path, transform=None): self.data = pd.read_csv(csv_path) self.labels = pd.get_dummies(self.data['emotion']).as_matrix() self.height = 48 self.width = 48 self.transform = transform # def __getitem__(self, index): # This method should return only 1 sample and label # (according to "index"), not the whole dataset # So probably something like this for you: pixel_sequence = self.data['pixels'][index] face = [int(pixel) for pixel in pixel_sequence.split(' ')] face = np.asarray(face).reshape(self.width, self.height) face = cv2.resize(face.astype('uint8'), (self.width, self.height)) label = self.labels[index] # return face, label # def __len__(self): return len(self.labels) # # dataset = CustomDatasetFromCSV(my_path) batch_size = 16 validation_split = .2 shuffle_dataset = True random_seed= 42 # # Creating data indices for training and validation splits: dataset_size = len(dataset) indices = list(range(dataset_size)) split = int(np.floor(validation_split * dataset_size)) if shuffle_dataset : np.random.seed(random_seed) np.random.shuffle(indices) train_indices, val_indices = indices[split:], indices[:split] # # Creating PT data samplers and loaders: train_sampler = SubsetRandomSampler(train_indices) valid_sampler = SubsetRandomSampler(val_indices) # train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=train_sampler) validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=valid_sampler) # # Usage Example: num_epochs = 10 for epoch in range(num_epochs): # Train: for batch_index, (faces, labels) in enumerate(train_loader): # ...
Shital Shah - vote: 25
现有回答采用的是随机分割,但每个类的样本会有数量不平衡的缺点。当每类只需要少量样本时,这个问题尤为致命。例如,手写数字集 MNIST 有六万个样本,即每个数字六千个样本。假设每个训练集中你只需要30个样本,随机分割就会在类间产生不平衡(某数字比其它数字有更多的训练数据)。所以还得确保每个数字只有三十个标签,这叫做分层抽样。
译者注:他写的好长,我看懵了,差点没理解过来。简单来说就是随机抽样是在整个数据集里抽取指定数量样本;分层抽样是指在数据集的每个类别中抽取指定数量样本,再加起来作为一个训练集。一个解决方案是用 Pytorch 的接口,见示例代码。
另一种方式是自己实现 :)。例如,下面关于 MNIST 的一个简单实现中,
ds
是MNIST 数据集,k
是每个类的样本数。def sampleFromClass(ds, k): class_counts = {} train_data = [] train_label = [] test_data = [] test_label = [] for data, label in ds: c = label.item() class_counts[c] = class_counts.get(c, 0) + 1 if class_counts[c] <= k: train_data.append(data) train_label.append(torch.unsqueeze(label, 0)) else: test_data.append(data) test_label.append(torch.unsqueeze(label, 0)) train_data = torch.cat(train_data) for ll in train_label: print(ll) train_label = torch.cat(train_label) test_data = torch.cat(test_data) test_label = torch.cat(test_label) return (TensorDataset(train_data, train_label), TensorDataset(test_data, test_label))
然后这样使用它:
def main(): train_ds = datasets.MNIST('../data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor() ])) train_ds, test_ds = sampleFromClass(train_ds, 3)
How do I split a custom dataset into training and test datasets?
nirvair asked:
import pandas as pd import numpy as np import cv2 from torch.utils.data.dataset import Dataset # class CustomDatasetFromCSV(Dataset): def __init__(self, csv_path, transform=None): self.data = pd.read_csv(csv_path) self.labels = pd.get_dummies(self.data['emotion']).as_matrix() self.height = 48 self.width = 48 self.transform = transform # def __getitem__(self, index): pixels = self.data['pixels'].tolist() faces = [] for pixel_sequence in pixels: face = [int(pixel) for pixel in pixel_sequence.split(' ')] # print(np.asarray(face).shape) face = np.asarray(face).reshape(self.width, self.height) face = cv2.resize(face.astype('uint8'), (self.width, self.height)) faces.append(face.astype('float32')) faces = np.asarray(faces) faces = np.expand_dims(faces, -1) return faces, self.labels # def __len__(self): return len(self.data)
This is what I could manage to do by using references from other repositories. However, I want to split this dataset into train and test.
这段代码是我从其它地方参考的,但我还想将数据集分为训练集和测试集。How can I do that inside this class? Or do I need to make a separate class to do that?
能在这个类里面直接实现嘛?还是需要分开来实现?
Answers:
Fábio Perez - vote: 156
Starting in PyTorch 0.4.1 you can use
random_split
:
Pytorch 0.4.1 以上版本可以使用random_split
:train_size = int(0.8 * len(full_dataset)) test_size = len(full_dataset) - train_size train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
benjaminplanche - vote: 127
Using Pytorch\'s
SubsetRandomSampler
:
试试 Pytorch 的SubsetRandomSampler
:import torch import numpy as np from torchvision import datasets from torchvision import transforms from torch.utils.data.sampler import SubsetRandomSampler # class CustomDatasetFromCSV(Dataset): def __init__(self, csv_path, transform=None): self.data = pd.read_csv(csv_path) self.labels = pd.get_dummies(self.data['emotion']).as_matrix() self.height = 48 self.width = 48 self.transform = transform # def __getitem__(self, index): # This method should return only 1 sample and label # (according to "index"), not the whole dataset # So probably something like this for you: pixel_sequence = self.data['pixels'][index] face = [int(pixel) for pixel in pixel_sequence.split(' ')] face = np.asarray(face).reshape(self.width, self.height) face = cv2.resize(face.astype('uint8'), (self.width, self.height)) label = self.labels[index] # return face, label # def __len__(self): return len(self.labels) # # dataset = CustomDatasetFromCSV(my_path) batch_size = 16 validation_split = .2 shuffle_dataset = True random_seed= 42 # # Creating data indices for training and validation splits: dataset_size = len(dataset) indices = list(range(dataset_size)) split = int(np.floor(validation_split * dataset_size)) if shuffle_dataset : np.random.seed(random_seed) np.random.shuffle(indices) train_indices, val_indices = indices[split:], indices[:split] # # Creating PT data samplers and loaders: train_sampler = SubsetRandomSampler(train_indices) valid_sampler = SubsetRandomSampler(val_indices) # train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=train_sampler) validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=valid_sampler) # # Usage Example: num_epochs = 10 for epoch in range(num_epochs): # Train: for batch_index, (faces, labels) in enumerate(train_loader): # ...
Shital Shah - vote: 25
Current answers do random splits which has disadvantage that number of samples per class is not guaranteed to be balanced. This is especially problematic when you want to have small number of samples per class. For example, MNIST has 60,000 examples, i.e. 6000 per digit. Assume that you want only 30 examples per digit in your training set. In this case, random split may produce imbalance between classes (one digit with more training data then others). So you want to make sure each digit precisely has only 30 labels. This is called stratified sampling.
现有回答采用的是随机分割,但每个类的样本会有数量不平衡的缺点。当每类只需要少量样本时,这个问题尤为致命。例如,手写数字集 MNIST 有六万个样本,即每个数字六千个样本。假设每个训练集中你只需要30个样本,随机分割就会在类间产生不平衡(某数字比其它数字有更多的训练数据)。所以还得确保每个数字只有三十个标签,这叫做分层抽样。
译者注:他写的好长,我看懵了,差点没理解过来。简单来说就是随机抽样是在整个数据集里抽取指定数量样本;分层抽样是指在数据集的每个类别中抽取指定数量样本,再加起来作为一个训练集。One way to do this is using sampler interface in Pytorch and sample code is here.
一个解决方案是用 Pytorch 的接口,见示例代码。Another way to do this is just hack your way through :). For example, below is simple implementation for MNIST where
ds
is MNIST dataset andk
is number of samples needed for each class.
另一种方式是自己实现 :)。例如,下面关于 MNIST 的一个简单实现中,ds
是MNIST 数据集,k
是每个类的样本数。def sampleFromClass(ds, k): class_counts = {} train_data = [] train_label = [] test_data = [] test_label = [] for data, label in ds: c = label.item() class_counts[c] = class_counts.get(c, 0) + 1 if class_counts[c] <= k: train_data.append(data) train_label.append(torch.unsqueeze(label, 0)) else: test_data.append(data) test_label.append(torch.unsqueeze(label, 0)) train_data = torch.cat(train_data) for ll in train_label: print(ll) train_label = torch.cat(train_label) test_data = torch.cat(test_data) test_label = torch.cat(test_label) return (TensorDataset(train_data, train_label), TensorDataset(test_data, test_label))
You can use this function like this:
然后这样使用它:def main(): train_ds = datasets.MNIST('../data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor() ])) train_ds, test_ds = sampleFromClass(train_ds, 3)
共有 0 条评论