You are here

Workshop on Big Data and Machine Learning

Thursday - Saturday, April 11 - 13, 2019

Maxim Doucet Hall

Department of Mathematics

University of Louisiana at Lafayette, Lafayette, Louisiana

This workshop will focus on recent developments related to big data and machine learning.

Big Data

Professor Erniel B. Barrios
School of Statistics
University of the Philippines Diliman, Philippines

There will be four sessions on big data.

  1. Topics in High Dimensional Data
  2. Models of Customer Survival Data
  3. Analyzing Multiple Time Series Data
  4. Some Open Topics

Machine Learning

Professor Gerhard Dikta
University of Applied Science
FH Aachen, Germany

There will be two sessions on machine learning.

  1. Probability of damage of electronic systems due to indirect lightning strikes
    A real application example I did years ago for the German insurance industry.
  2. Bootstrap approximations in model checks
    New model validation approaches that can be used in statistical learning.

Registration (required)

Register here. There is no registration fee. However, to aid in planning please complete the registration form as soon as possible. The registration deadline is Monday, 8 April 2019.

Schedule

Thursday Afternoon, 11 April 2019
Maxim Doucet Hall room 211

Time Topic and speaker
4:30 - 4:45 Refreshments
4:45 - 6:00 Probability of damage to electronic systems due to indirect lightning strikes
Gerhard Dikta

Friday Afternoon, 12 April 2019
Maxim Doucet Hall room 211

Time Topic and speaker
12:45 - 2:00 Topics in High Dimensional Data
Erniel B. Barrios
2:00 - 2:30 Refreshments
2:30 - 3:45 Bootstrap approximations to check parametric regression models
Gerhard Dikta
3:45 - 4:15 Refreshments
4:15 - 5:30 Models of Customer Survival Data
Erniel B. Barrios

Saturday Morning, 13 April 2019
Maxim Doucet Hall room 211

Time Topic and speaker
8:45 - 9:00 Refreshments
9:00 - 10:15 Analyzing Multiple Time Series Data
Erniel B. Barrios
10:15 - 10:45 Refreshments
10:45 - 12:00 Some Open Topics
Erniel B. Barrios

Abstracts

Probability of damage to electronic systems due to indirect lightning strikes
Gerhard Dikta
11 April 2019

German household insurance covers damage to an electronic system if the damage is caused by a lightning strike. In the years 2002-2005, a sharp increase in claims of this kind was observed among insurance companies. In order to meet this increasing demand, the GDV supported a project with the aim of analyzing the distance between a lightning strike and the location where the damage occurred. In this lecture a model for the distribution of these distances is discussed and applied to real data from the insurance companies. The modelling is based on about 75000 damage reports from the year 2005.

Topics in High Dimensional Data
Erniel Barrios
12 April 2019

The data generating process resulting to big data is often characterized by complex dependence structure. The data exhibits heterogeneity as a result of pooling together data coming different sources. Representation of such data would require large number of variables (features), often labelled as high dimensional. Two approaches in dealing with high dimensional data will be discussed. First is dimension reduction where high dimensional features will be translated into lower dimensions. A method that accounts for data characteristics arising from heterogeneity of pooled data will be discussed. The second approach will do away with dimension reduction as pre-analysis prior to modeling and proposes to develop a model that will simultaneously select features of the data while fitting a predictive model. This method is then applied to quality of life index.

Bootstrap approximations to check parametric regression models
Gerhard Dikta
12 April 2019

Suppose we observe a series of binary data along with explanatory variables and we suspect that these observations belong to a parametric regression model. To verify this assumption, we use Kolmogorov-Smirnov and Cramér von Mises type tests based on a maximum likelihood estimate of the parameter and a marked empirical process introduced by Stute. We determine the critical values for the tests with a special bootstrap procedure in which the resampling scheme is adapted to the parametric setup. The approach presented is discussed in the context of machine learning and how it can be applied to generalized linear models, distinguishing between semi-parametric and parametric GLMs. Finally, this approach is applied to simulated and real data. In the latter case, we review parametric model assumptions of some right censored data.

Models of Customer Survival Data
Erniel Barrios
12 April 2019

In a highly competitive sector like the telecommunications industry, recruitment of new customers is more expensive than strategies that induce loyalty and patronage among existing customers. Survival models are used in characterizing customer behavior, the models are then used in Customer Lifetime Valuation (CLV). CLV is then used in loyalty incentive planning/offers. Various features of the data-generating process provided stimulus in the development of new statistical methods to be discussed.

Analyzing Multiple Time Series Data
Erniel Barrios
13 April 2019

In addition to the telecommunications industry, credit card transactions, bank accounts, financial markets also contributed in the early evolution of big data. Data from these sectors are often characterized by multiple time series. Multiple time series is differentiated from multivariate time series or from panel data. An estimation procedure for models in multiple time series data is proposed. Statistical methods are developed to contribute in the analysis of other features of multiple time series data.

Some Open Topics
Erniel Barrios
13 April 2019

There are many themes describing features of big data. Some common topics of interest includes varying frequencies and clustering. We present some statistical problems formulated from these topics, initial results generated so far and some open problems are also discussed. Further features of multiple time series are extended to the concept of changepoint analysis and in clustering of time series. Some initial work on text mining will also be discussed.

Information

Please direct any inquiries to Nabendu Pal or Bruce Wade.