大数据分析 - 在线学习

大数据分析在线学习 - 从概述，数据生命周期，方法论，核心可交付成果，关键利益相关者，数据分析师，科学家，问题定义，数据收集，清理，总结，探索，可视化开始，简单易学地学习大数据分析，R简介，SQL简介，图表和图形，数据分析工具，统计方法，数据分析机器学习，朴素贝叶斯分类器，K均值聚类，关联规则，决策树，Logistic回归，时间序列分析，文本分析，在线学习。

在线学习是机器学习的一个子领域，可以将监督学习模型扩展到海量数据集.基本思想是我们不需要读取内存中的所有数据以适应模型，我们只需要一次读取每个实例.

在这种情况下，我们将展示如何使用逻辑回归实现在线学习算法.与大多数监督学习算法一样，存在最小化的成本函数.在逻辑回归中，成本函数定义为 :

$$ J(\ theta)\:= \:\ frac {-1} {m} \ left [\sum_ {i = 1} ^ {m} y ^ {(i)} log(h_ {\ theta}(x ^ {(i)}))+(1 - y ^ {(i)} )log(1 - h_ {\ theta}(x ^ {(i)}))\ right] $$

其中 J(θ)表示成本函数， h _θ(x)表示假设.在逻辑回归的情况下，它使用以下公式定义 :

$$ h_\theta(x)= \ frac {1} {1 + e ^ {\ theta ^ T x}} $$

现在我们已经定义了成本函数，我们需要找到一个算法来最小化它.实现此目的的最简单算法称为随机梯度下降.逻辑回归模型权重算法的更新规则定义为 :

$$ \theta_j:= \ theta_j - \ alpha(h_\theta( x) - y)x $$

以下算法有几种实现方式，但在 vowpal wabbit 库是迄今为止最发达的库.该库允许训练大规模回归模型并使用少量RAM.在创作者自己的话中，它被描述为:"Vowpal Wabbit(VW)项目是由微软研究院和(之前)Yahoo! Research赞助的快速核心学习系统."

我们将使用来自 kaggle 竞赛的泰坦尼克数据集.原始数据可以在 bda/part3/vw 文件夹中找到.在这里，我们有两个文件 :

我们有训练数据(train_titanic.csv)和
未标记的数据以进行新的预测(test_titanic.csv).

为了将csv格式转换为 vowpal wabbit 输入格式使用 csv_to_vowpal_wabbit.py python脚本.你显然需要为此安装python.导航到 bda/part3/vw 文件夹，打开终端并执行以下命令 :

python csv_to_vowpal_wabbit.py

请注意，对于本节，如果您使用的是Windows，则需要安装Unix命令行，输入 cygwin 网站.

打开终端，也打开文件夹 bda/part3/vw 并执行以下命令 :

vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --sgd --l1 0.00000001 --l2 0.0000001 --learning_rate 0.5 --loss_function logistic

让我们分解 vw call 的每个参数意味着什么.

-f model .vw : 意味着我们将模型保存在model.vw文件中以便稍后进行预测
- binary : 使用-1,1标签报告丢失为二进制分类
- 传递20 : 数据使用20次来学习权重
-c : 创建缓存文件
-q ff : 在f命名空间中使用二次特征
- sgd : 使用常规/经典/简单随机梯度下降更新，即非自适应，非标准化和非不变量.
- l1 - l2 : L1和L2规范正则化
- learning_rate 0.5 : 学习率α在更新规则公式中定义

以下代码显示在命令行中运行回归模型的结果.在结果中，我们得到平均对数丢失和算法性能的小报告.

-loss_function logisticcreating quadratic features for pairs: ff  using l1 regularization = 1e-08 using l2 regularization = 1e-07 final_regressor = model.vw Num weight bits = 18 learning rate = 0.5 initial_t = 1 power_t = 0.5 decay_learning_rate = 1 using cache_file = train_titanic.vw.cache ignoring text input in favor of cache input num sources = 1 average    since         example   example  current  current  current loss       last          counter   weight    label   predict  features 0.000000   0.000000          1      1.0    -1.0000   -1.0000       57 0.500000   1.000000          2      2.0     1.0000   -1.0000       57 0.250000   0.000000          4      4.0     1.0000    1.0000       57 0.375000   0.500000          8      8.0    -1.0000   -1.0000       73 0.625000   0.875000         16     16.0    -1.0000    1.0000       73 0.468750   0.312500         32     32.0    -1.0000   -1.0000       57 0.468750   0.468750         64     64.0    -1.0000    1.0000       43 0.375000   0.281250        128    128.0     1.0000   -1.0000       43 0.351562   0.328125        256    256.0     1.0000   -1.0000       43 0.359375   0.367188        512    512.0    -1.0000    1.0000       57 0.274336   0.274336       1024   1024.0    -1.0000   -1.0000       57 h 0.281938   0.289474       2048   2048.0    -1.0000   -1.0000       43 h 0.246696   0.211454       4096   4096.0    -1.0000   -1.0000       43 h 0.218922   0.191209       8192   8192.0     1.0000    1.0000       43 h finished run number of examples per pass = 802 passes used = 11 weighted example sum = 8822 weighted label sum = -2288 average loss = 0.179775 h best constant = -0.530826 best constant’s loss = 0.659128 total feature number = 427878

现在我们可以使用模型. vw 我们训练用新数据生成预测.

vw -d test_titanic.vw -t -i model.vw -p predictions.txt

上一个命令中生成的预测未规范化以适应[0,1]范围.为了做到这一点，我们使用了sigmoid转换.

# Read the predictionspreds = fread('vw/predictions.txt')  # Define the sigmoid function sigmoid = function(x) {    1 / (1 + exp(-x)) } probs = sigmoid(preds[[1]])  # Generate class labels preds = ifelse(probs > 0.5, 1, 0) head(preds) # [1] 0 1 0 0 1 0