This article presents a method of estimating discrete choice models using convolutional neural network (CNN), so that the powerful predicting abilities of neural network could be utilized in choice situations. Three results are shown through experiments: 1) a MNL model is completely equivalent to a CNN with 1 convolutional layer, in which there only 1 out channel and no nonlinear activation; 2)  CNN with multiple convolutional layers + multiple out channels in each layer + nonlinear activation could greatly solve the nonlinear utility problem which is nearly inevitable in practice; 3)however, such CNN could not solve the problems concerning the more complex discrete choice models, such as nested logit model.

When a MNL model needs to be improved, there are generally 2 main directions. The one is to better specify the utilities, such as introducing nonlinear terms; the other is to replace the MNL form using more refined NL, MXL, etc. From the findings above, the former could be achieved by CNN, while the latter not.


There experiments are conducted using 3 pieces of Python codes in ‘PythonCode.zip’. ‘identical_simple_cnn_mnl.py’ shows a simple CNN could be equivalent to MNL; ‘complex_cnn.py’ shows how CNN well performs when utilities in MNL are nonlinear; ‘nested_logit.py’ shows CNN is incapable of more complex models, such as nested logit model.



1.  用CNN实现MNL (Generating the Same Results of MNL Using CNN)


Neural network has outstanding reputation in many kinds of prediction problems, however, it is seldom used in predicting individual choices. At the same time, discrete choice models generally assume linear utilities, which are problematic in many occasions and thus need the powerful nonlinearity of neural network.

有人可能会想:神经网络常用的分类问题与选择问题有相似之处,因变量都是离散的,能否直接推广?答案可能没那么简单,因为二者的数据形式是不同的。分类问题中,1个样本的1个解释变量通常只有1个值,例如:根据体温判断“有病/没病”,每个就诊者的体温是1个数值。而在N个备选项的选择问题中,1个样本的1个解释变量常常有N个值 ——每个备选项1个,例如,出行者根据行程时间选择“公交/出租车/步行”3种交通方式时,每1种交通方式都对应1个行程时间,共3个行程时间。由于这种差别,分类问题通常是1行数据对应1个样本;而选择问题常常采用长表形式,每N行组成一个“组”,共同对应1个样本,下表是交通方式选择的2个样本。其中,Group代表样本编号,Alt代表备选项名称,Choice是选择结果,每一组中有且只能有一个“1”,代表该行的备选项被选中,其他的“0”代表未被选中。

People may believe that we could generalize the applications of neural network in classification to choice, since the dependent variables are discrete in both classification and choice. However, it might not that easy, because the data format are different in classification and choice problems. For any explanatory variable,  one case usually has ONLY ONE value in a classification problem; by contrast, in a choice problem with N alternatives, one case usually has N dependent values for certain explanatory variable, that is to say, each alternative would have a corresponding value. For example, when travelers choose traffic modes among bus/ taxi /walk according travel time, each mode has its own travel time. Consequently, it is general to have only one row for a case in a classification problem, while in a N-alternative choice problem, a case often need N rows to be a group, as shown in the following table, 2 cases corresponds to 6 rows.

Group Alt=alternative Choice TT = travel time (min) Price (¥)
1 Bus 1 20 2
1 Taxi 0 10 20
1 Walk 0 70 0
2 Bus 0 40 5
2 Taxo 0 10 20
2 Walk 1 100 0


How to deal with this kind of data using neural network? If we adopt classical fully connected neural network, 6 inputs nodes are needed to handle the 2 variables (TT and Price), which stand for ‘Bus TT’, ‘Taxi TT’, ‘Walk TT’, ‘Bus Price’, ‘Taxi Price’, ‘Walk Price’ respectively. This way should also work, but seems a little bit strange.


As we all know, a fully connected neural network without hidden layers and nonlinear activation functions would just equal to a linear regression.  Similarly, I wonder if there is a certain kind of neural network completely equal to MNL.


Let’s revisit the specification of MNL. In the case above, we could use two coefficients, b(TT) and b(Price) to specify a MNL model, as shown below. Constants are omitted for simplicity.

V(Bus) = b(TT) * TT(Bus)  +  b(Price) * Price(Bus)
V(Taxi) = b(TT) * TT(Taxi)  +  b(Price) * Price(Taxi)
V(Walk) = b(TT) * TT(Walk)  +  b(Price) * Price(Walk)

P(Bus) = exp(Bus)  / (exp(Bus)  +  exp(Taxi)  +  exp(Walk))
P(Taxi) = exp(Taxi)  / (exp(Bus)  +  exp(Taxi)  +  exp(Walk))
P(Walk) = exp(Walk)  / (exp(Bus)  +  exp(Taxi)  +  exp(Walk))

这样的形式让我想到:计算效用(V)的过程,不就等同于拿一个参数向量[b(TT), b(Price)]依次扫描每一个备选项,然后与该备选项对应的自变量做卷积吗?虽然卷积大多出现在图像分析中,但是我们可以类似地把一个样本想像成一张图片,行数/高度是备选项数(NALT),列数/长度是解释变量个数(NVAR),那种上述过程就是在拿一个“1 * NVAR”的单行卷积核在扫描该图片的每一行,输出一个“NALT * 1”的单列图片结果,该结果中每一个元素即为对应备选项的效用,如下图所示。最后,由效用计算概率就是一个一般的Softmax过程。

Such specifications remind me of convolution. What we are doing in calculating utilities (V) is actually a special kind of convolutional process by scanning each alternative using a fixed row vector [b(TT), b(Price)]. We could treat a N-row case as a picture, whose height is the number of alternatives (NALT), and whose length is the number of variables (NVAR), and this convolutional process is to scan each row of the picture using a ‘1 * NVAR’ row-kernel, and output a ‘NALT * 1’ column-picture. Finally, calculating probabilities according to utilities is a softmax process.


Following this way, I tried to set up a MNL model using CNN. For a choice situation with NALT alternatives and NVAR explanatory variables, the tensorflow codes for CNN is as follows.

X = tf.placeholder(tf.float32, (None, NALT,  NVAR, 1))
Y = tf.placeholder(tf.float32, (None, NALT))
W = tf.get_variable(‘W’, [1,NVAR,1,1], initializer=tf.zeros_initializer())
Z = tf.nn.conv2d(X, W, (1,1,1,1),”VALID”)
Z2 = tf.contrib.layers.flatten(Z)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=Z2))
opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

其中,输入X的维度是表示每个样本是一个“NALT * NVAR”的“图片”,只有1个通道,W的平面维度是“1*NVAR”,1个输入通道承接X,1个输出通道表示只有一个卷积核。自然地,这里面没有加常数项,没有做非线性激活,也没有池化层,就只有1个线性卷积得到Z,展开成Z2是为了维度正确,最后将Z2做softmax后与实际结果Y一起计算交叉熵损失。最小化交叉熵与最大似然又是等价的。

The input dimension says that each case is a ‘picture’ with the size of ‘NALT * NVAR’ and one channel. The size of the convolutional weight ‘W’ is ‘1 * NVAR’, one input channel corresponds to the one channel of X, and one output channel means there is only one filter. There are no constants, no nonlinear activation, and no pooling layers, the whole calculation is to generate ‘Z’ using a linear convolution filter. From ‘Z’ to ‘Z2’ using flatten function is purely for the dimension, and the cross entropy loss is derived from ‘Z2’ and the observation ‘Y’. Finally, minimizing the cross entropy loss equals to maximizing the log likelihood.


Conducting experiments in the following way. First, randomly generate data and coefficients; second, simulate the choice results according the probabilities derived from the MNL model; third, treat the simulated choice results as dependent variables, establish a MNL model and a CNN model.

The following picture shows the estimating result of a random trial with 10 alternatives, 10 explanatory variables and 1000 choice situations. It could be seen that 1) the absolute values of the initial / final log likelihood and the initial / final cross entropy are exactly the same; 2) although CNN model is trained for 1000 steps, it converge actually in less than 200 steps and then just does nothing; 3) the coefficients of MNL and CNN are exactly the same.


Consequently, this experiment proves that a simple CNN could function exactly as MNL given certain specification. Moreover, each parameter of this CNN is explainable,  although in most other CNNs we do not care much about the explanation.

2.  非线性的解释变量:CNN的优势 (Nonlinear Utilities: Advantage of CNN)

在更进一步之前,不妨做一点小的变换,把上面那张数据结构表侧过来:原来在高度方向上排列的各个备选项不变,把原来在长度方向排列的各个解释变量侧到深度方向,有NVAR个变量就有NVAR个通道,这样就成了下图的形式。此时,每个样本可以看成是尺寸为“NALT * 1”的图片,但有NVAR个通道。

Before go further, let’s take a small transformation to turn the above data-table sideways. All alternatives are still stacked on the hight dimension, however, the explanatory variables, which were stacked on the length dimension before, are now turned to the depth dimension, that means the new picture would have a size of ‘NALT * 1’, and NVAR channels, as shown in the figure below.

这样变换的目的是更直观地利用神经网络的功能。上图中的Utility是经过某一个特定的卷积核[b(TT), b(Pr)]得到的,但在CNN中卷积层中,一般都会在使用多个卷积核,每个卷积核的结果都是一个“NALT * 1”的图片,相当于利用原始变量生成一个新属性,每个备选项在这个新属性上都有一个对应取值。多个卷积核的多个属性结果再在深度方向上叠置,假设有4个卷积核 ,那么就实现了从“Travel Time,  Price”到“Feature1, Feature2, Feature3, Feature4”的学习过程。当然,CNN中也常常包括多个卷积层,我们完全可以再把“Feature1, Feature2, Feature3, Feature4”以相同的方式转化为“NextLayerFeature1, NextLayerFeature2,… , NextLaterFeatureN”。在最后一层,我们使用1个卷积核,得到代表每个选项最终得分的“Utiliy”层,然后送入softmax计算概率。形式如下:

Such transformation is to use multiple filters in a  layer more intuitively. The ‘Utility’ in the above figure is derived from a certain filter [b(TT), b(Pr)], while a typical CNN would use multiple filters. Each filter will generate a ‘NALT * 1’ output picture, which could be deemed as a new feature coming from the original features. Multiple output-features derived from multiple filters are then similarly stacked on the depth dimension. Supposing we have 4 filters, such calculations are the learning process from the original 2 features ‘Travel Time, Price’ to the new 4 features ‘Feature1, Feature2, Feature3, Feature4’. Moreover, a typical CNN may contains multiple convolutional layers, thus we could similarly generate ‘NextLayerFeature1, NextLayerFeature2, …, NextLayerFeatureN’ from ‘Feature1, Feature2, Feature3, Feature4’. This learning process just goes on, until in the last layer only one filter is used to generate the final score ‘Utility’, which is then sent to the softmax layer.The whole network structure is as follows.


Comparing with the simple CNN which is equivalent to MNL, this CNN further employs the advantages of neural networks in 3 aspects: 1) there are multiple filters in each convolutional layer; 2) there are multiple convolutional layers, thus a simple deep learning; 3) each convolutional layer is activated non-linearly.


However, this CNN also differs with typical CNNs for computer vision tasks in the following aspects: 1) the size of the sample ‘picture’ keeps ‘NALT * 1’, and the size of the filter keeps ‘1*1’; 2) no pooling layer; 3) no extra padding and stride considerations. Actually, this network just use the extrinsic form of CNN, and might be better described as ‘pseudo CNN’.

下面就来检验伪CNN的能力吧。离散选择模型最大的问题之一就是其效用设定是线性的,而实际情况可能有很复杂的非线性。那么我们来构造一组非线性效用的选择数据。假设我们观察到4个解释变量(X1, X2, X3, X4),它们对选择的影响方式是两两依交互的:V = b1*X1*X2 + b2*X2*X3 + b3*X3*X4 + b4*X4*X1。备选项数设定为4,选择样本数设定为1000。

Now let’s check out the abilities of the ‘pseudo CNN’. One of the greatest problems of traditional discrete choice models is the linear specification of the utilities, which is unreliable in most cases. We could deliberately design a set of choice data with nonlinear utilities: suppose we observe 4 explanatory variables (X1, X2, X3, X4), and they are interacting with each other to constitute the utility, V = b1*X1*X2 + b2*X2*X3 + b3*X3*X4 + b4*X4*X1, the number of alternatives is 4, and the number of choice situations is 1000.

我们拟合3个模型,MNL1是直接使X1~X4的MNL模型;MNL2是使用Feature1=X1*X2, Feature2=X2*X3,Feature3=X3*X4,Feature4=X4*X1这种变换过的新属性的MNL模型,在实际中我们是不可能拟合这个模型的,因为我们并不知道真实的变换关系;模型3是直接使用X1~X4的CNN模型,CNN的网络结构为3个卷积层,卷积核数量分别为20、40、1,前两层采用Relu作为非线性激励函数,最后一层直接线性输出到softmax层。同时,采用70%:30%的比例划分训练与验证集,结果如下。

We estimate 3 models. MNL1 is the MNL model directly using X1~X4 as independent variables; MNL2 is the MNL model using Feature1~Feature4, where Feature1=X1*X2, Feature2=X2*X3,Feature3=X3*X4,Feature4=X4*X1, please notice that this model can not be estimate in practice, since we should have no idea about the real relationship between Feature1~Feature4 and X1~X4; the third one is the CNN model directly using X1~X4, this network contains 3 convolutional layers,  each has 20, 40, 1 filters, the first 2 layers have Relu activation and the last layer has no nonlinear activation. The whole data is split into training and test set (70%:30%). The estimating results are as follows.

可以看到,MNL1的表现惨不忍睹,由此可见非线性问题的严重性。MNL2模型是“真实模型”,在训练与预测集上的表现差不多,也体现了MNL模型的稳健性。CNN就比较浮夸了,在训练集上的表现好得亮瞎狗眼,而在验证集上的表现则差得亮瞎狗眼,瞎蒙的水平都比它强得多,这显然是极其严重的过拟合。当然,能在训练集上表现那么好,也体现了神经网络的 彪悍之处,只要网络够复杂,什么都不在话下。不过我相信,由于我们的数据实际就是根据MNL2生成的,那么一切比MNL2好的拟合结果都是不可信的。

MNL1 has a quite poor performance, indicating its inability to the nonlinearity. MNL2, the ‘true model’, has similar performance in both training and test data, showing robustness of MNL. In contrast, CNN has a unbelievable wonderful goodness of fit in the training data, and extraordinary poor performance in the test data, which fall far behind randomly guess.  Apparently there is a serious over-fitting in CNN. Nevertheless, its great performance in the training data shows the strong power of neural network.


There are many ways to deal with over-fitting in neural network, here we simply  add a L2 regularization and re-experiment.


For the new data, MNL1 / MNL2 keeps very bad / good  performance. The over-fitting problem of CNN is basically solved, it falls slightly behind MNL2  in both training and test data , but  greatly outperforms MNL1. Consequently, it is safe to say that when the utilities have nonlinear forms, CNN is superior to MNL.  On the other hand, MNL has its own advantages in simplicity and interpretability, while CNN is purely a predicting tool and prone to over-fitting.


I have also tried some other forms of nonlinear utilities, 3 of which are given in the codes. For every nonlinear utility form, CNN outperform MNL.

3.  复杂的选择模型形式:CNN无能为力 (More Complex Choice Model: Inability of CNN)

看到CNN有这样的表现,我进一步期待能否用CNN实现更复杂的离散选择模型,如嵌套Logit模型(nested logit, NL)、混合Logit模型(mixed logit, MXL)。比如说,某个选择问题由于备选项之间的相关关系,实际上应该用NL模型来做的,但我们对此并不清楚,此时还用简单的MNL效果肯定不好,那么用CNN的效果怎么样?我们再通过实验来检验。

Furthermore, I expect if CNN could implement more complicated discrete choice models, such as nested logit (NL), mixed logit (MXL). Supposing we have a choice problem that should be treated using NL because of correlations among alternatives, but we still employ simple MNL and get an unsatisfactory result. Then how about CNN? Let’s try new experiments.


Assuming each two adjacent alternatives (Alt1 & Alt2, Alt3 & Alt4,…) are in the same nest, and each nest has a parameter lambda, representing the degree of correlation among the two alternatives in this nest. 3000 choice situations are randomly generated according to this specification, with 6 alternatives and 10 explanatory variables. 3 models, MNL, NL, and CNN are estimated. The CNN model has the same network structure as before and also use L2 regularization. The results are as follows.


First, let’s compare the true coefficients and estimated coefficients of MNL and NL. It is clear that NL coefficients are closer to the true settings. Although the parameters of different models should not be compared directly, the MNL results are apparently biased.


As to goodness of fit, NL, the true model,  has the best performance unsurprisingly. Compared with MNL, CNN performs slightly better in the training data, while slightly worse in the test data. That is to say, CNN has no advantages over simple MNL, it could no handle the complex tree structure in the NL.


I would not spend time in more complicated MXL model, and just believe that these refined discrete choice models are irreplaceable by CNN.



    请问一下,第2节中出现的V = b1*X1*X2 + b2*X2*X3 + b3*X3*X4 + b4*X4*X1,可能会出现这种效用表达式吗?如果有,代表什么含义呢?谢谢

      X1, X2, X3, X4的两两相乘只是一种抽象的算例,并没有实际意义。

