User Tools

Site Tools


en:lectures:hima:start

Table of Contents

Higher Mathematics Notes

These pages contains notes as supplements for the lecture on Higher Mathematics in the Water Engineering Master. Notable, the page contains Latex equations, data samples and R or Python code for the exercises.

I Linear Regression

Theory

Simple Regression - one independent variable

We assume that there is a relationship:

$$\hat{Y} = b_0+b_1 \cdot X$$

with $b_0$ the intercept and $b_1$ the factor for values to estimate $\hat{Y}$. This estimate will have an error

$$e_i = y_i-\hat{y_i}$$

We will now sum up all errors:

$$\sum e_i = \sum{(y_i-\hat{y_i})}$$

The regression model is obtained by combining both equations:

$$\sum e_i = \sum {(y_i-b_0+b_1 \cdot x_i)}$$

Instead of the simple deviations, we take the square of the deviations to avoid that positive and negative values cancel out:

$$\sum{e_i}^2 = \sum {(y_i-b_0+b_1 \cdot x_i)^2}$$

We will find the minimum by deriving the partial derivatives

$$\frac{\partial{\sum {(y_i-b_0+b_1 \cdot x_i)^2}}}{\partial{b_0}}$$

and

$$\frac{\partial{\sum {(y_i-b_0+b_1 \cdot x_i)^2}}}{\partial{b_1}}$$

These partial derivatives can be re-arranged and give:

$$b_0 = \frac{\sum {x_i}^2 \sum {y_i}-\sum {x_i}\sum {x_i y_i}}{n\sum x_i^2-(\sum x_i)^2}$$

$$b_1 = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n \sum x_i^2-(\sum x_i)^2}$$

Linearization

Some (many common) non-linear functions can be transformed. You then perform linear regression on the transformed variable $u$ and $v$. Once you have the result, you do a backward transformation to $a$, $b$, $c$ and $d$ using the equations on the right.

Equation u v transformation backward
$y=ax^b$ $ln(x)$ $ln(y)$ $v=cu+d$ $a=e^d$, $b=c$
$y=ar^bx$ $x$ $ln(y)$ $v=cu+d$ $a=e^d$, $b=c$
$y=a/x+b$ $1/x$ $y$ $v=cu+d$ $a=c$, $b=d$
$y=a/(b+x)$ $x$ $1/y$ $v=cu+d$ $a=1/c$, $b=d/c$
$y=ax/(b+x)$ $1/x$ $1/y$ $v=cu+d$ $a=1/d$, $b=c/d$

Multiple Regression - several independent variables

Often the dependent variable $Y$ depends on several independent variables $X_1$, $X_2$,…,$X_n$. This relationship can be written as:

$$\hat{Y} = b_0+b_1 \cdot X_1+b_2 \cdot X_2+b_3 \cdot X_3 + ... + b_n \cdot X_n$$ and in a similar way an error function can be defined that is:

$$\sum{e_i}^2 = \sum {(y_i-b_0+b_1 \cdot x_{1i}+b_2 \cdot x_{2i}+b_3 \cdot x_{3i} + ... + b_n \cdot x_{ni})^2}$$

Choosing too many independent variables can make a multiple regression look pretty although the information content of these many variables is limited. Variables only help fitting a function but do not represent a real influence. In order to account for this effect an adjusted $R^2$ is often used [@Kronthaler2014]:

$$R^2_{adj}=R^2-\frac{J-(1-R^2)}{N-J-1}$$

Quality and significance

The quality of the regression model can be indicated using the $R^2$ metric.

$$ R^2 = 1 - \frac{\sum(y_i-\hat{y_i})^2}{\sum (y_i-\bar{y})^2}$$

The upper term $\sum(y_i-\hat{y_i})^2$ accounts for the variability that is not explained by the model. The lower term $\sum (y_i-\bar{y})^2$ accounts for the total variability. The $R^2$ metric is based on the idea that the variance can be subdivided into two components, one explained by the model and the other one that cannot be explained by a model. It is closely linked to the concept of ANOVA - analysis of variance. The value ranges from 0 to 1. In social sciences 0.3 models are excepted in sciences usually only models of >0.6.

The regression is based on a sample. It can and needs to be tested, whether the regression function contributes to explaining the dependent variable. First we need to specify the hypotheses:

$H_0$: The regression function does not contribute to explaining the dependent variable

$H_A$:: The regression function contributes to explaining the dependent variable

The test statistic is:

$$F = R^2-\frac{J \times (1-R^2)}{(N-J-1)}$$

with $J$ the number of independent variables and $N$ the number of observations.

The critical value of the F-distribution with $Fg_1=J=1$ and $Fg_2=N-J-1=5$ with $N=8$ and $J=1$ and $\alpha=0.01$ is 13.75.

$H_0$ that the model is not explaining the dependent variable is rejected. Hence, the model is valid.

The last step is to test the coefficients $b_i$ with the t-Test. For each coefficient $b_i$ and its standard deviation $s_{bi}$ the test statistic

$$t = \frac{b_i}{s_{bi}}$$

is calculated. The calculated statistic $t$ is compared to the value of the t-Student distribution with an $\alpha=0.01$ and $N-1$ degrees of freedom.

Practice

Some data are given to practice the theory by creating a list with c(). Data can be plotted with basic R using the command plot().

RegressionData.R
x=c(7,8,10,7,6,10,11,4)
y=c(5,8,18,10,7,12,16,2)
d = data.frame(x,y)
plot(d$x,d$y,xlab = "x", ylab = "y", pch = 19, lwd=1.5, cex=1.5)

A linear model (lm) can be fitted easily by invoking the command lm().

FitRegression.R
fm <- lm(formula = y ~ x, data = d)
# This is the key formula for regression. lm stands for linear model
# it has the parameters formula e.g. y~x (reads is in proportion to)
# and data containing the data.frame with y,x values

# The model yields a result that is stored in fm
# from fm we extract the predicted values with predict(fm)
d$predicted = predict(fm)

# We can also extract the residuals (erros) with residuals(fm)
d$residuals = residuals(fm)

# knitr::kable plots the values as a table
knitr::kable(head(d))

Residuals of the regression should be normally distributed. This can be checked by analyzing these data.

Residuals.R
library(ggplot2)
# ggplot is a better plotting program, aes() stands for aesthetics and contains the
# variables
ggplot(d, aes(x = x, y = y)) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +  
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +  
# you can easily add elements, e.g. points
  geom_point() +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

The results of a linear regression can be plotted to a variable (here it is reslm), the command summary(reslm) shows the full results of the linear regression.

Results.R
reslm <- lm(d$y~d$x)
summary(reslm)

For a value of $R^2=0.8162$ and 8 observations e.g., liner regression with 1 variable, the test statistic is $F=(0.8162/1)/(1-0.8162)(8-1-1)$

Testing.R
FTest=(0.8162/1)/(1-0.8162)*(8-1-1)
FTest
curve(df(x, df1=1, df2=6), from=0, to=6)

Worked Example - Assignment

Download Data for Exercise

Case Study

You have got data of the Wakenitz river. Check the correlation between Oxygen dissolved in water in mg/L (y, target or dependent variable) and temperature of water in degrees celsius (x, origin, explaining or independent variable).

Data need to be loaded first. You need to make sure that the file insitu.csv is in your working directly. Data have been uploaded as *.xls. Convert to *.csv first (I would like to work with base R).

loadData.R
library(readxl)
insitu <- read_excel("insitu.xls")
plot(insitu$TempC, insitu$O2mgL, xlab = "Temperature in C", 
   ylab = "oxygen conc. in mg/L", pch = 19, lwd=1.5, cex=1.5)

Task 1: Do a Linear Regression of temperature and oxygen concentration

The data for temperature and oxygen are contained in the file insitu.csv and can be loaded with the command read.csv(). Once you have a data table insitu with the variables TempC and $O_2 mg/L$ you can plot them and fit a linear regression with the command lm().

  1. Determine the correlation coefficient according to Bravais-Spearman
  2. plot the correlation plot using the library corrplot with the command corrplot(cor(data)). The term data stands for the data.frame which in this case is insitu
  3. Develop a linear regression and discuss whether the linear regression is significant and valid using the t-test statistic, the p-value and the error of the estimate for the slope

Task 2: Do a Multiple Linear Regression

Do a Multiple Linar Regression for several parameters (independent variables pH, temperature, chloride and nitrate concentration) on the dependent variable oxygen content. chemistry.csv and can be loaded with the command read.csv(). Once you have a data table chemistry with additional variables, try to explain the $O_2$ concentration in mg/L with the other variables by multiple regression. To get a first idea of the correlations you can use the command pair() which is part of base R. With corrplot() you can visualize the interactions.

  1. First calculate and plot the correlation with other variables using cor() and corrplot (with library corrplot).
  2. Then try to find a multiple regression model explaining O2 concentrations with temperature, pH, nitrate etc.
  3. Can you describe the model with your own words? What is going on, what causes oxygen to drop?
loadChemistry.R
library(corrplot)
library(readxl)
chemistry <- read_excel("chemistry.xls")
plot(chemistry$pH, chemistry$O2mgL, xlab="pH", ylab="O2 in mg/L", 
   pch = 19, lwd=1.5, cex=1.5)
# multiple scatterplots
pairs(chemistry[, c("pH", "TWasC", "O2mgL", "ClmgL","NO3mgL" )])
# For correlatioin plots only numerical variables can be used
# Therefore we create a subset of the numerical variables chemistrynum
# by substracting the first and second column that contain depth and location
chemistrynum <- chemistry[c(-1,-2)]
# The correlation plot helps to identify the type of correlation
corrplot(cor(chemistrynum))
Solution.R
# Space for the solution of the Problem

II Hypothesis Testing

Test-Distributions

t-Student

One of the most important distributions for testing is the t-Student distribution. The t-Student distribution can be called using the function dt(x,DF) with x as the value, we want to know the probability for and DF the degree of freedom. The higher the degree of freedom, the closer the distribution gets to the normal distribution.

tStudent.R
curve(dt(x, 30), from = -5, to = 5, col = "orange", 
      xlab = "quantile", ylab = "density", lwd = 2)
curve(dt(x, 10), from = -5, to = 5, col = "red", add = TRUE, lwd = 2)
curve(dt(x, 5), from = -5, to = 5, col = "green", add = TRUE, lwd = 2)
curve(dt(x, 2), from = -5, to = 5, col = "black", add = TRUE, lwd = 2)
legend("topleft", legend = paste0("DF = ", c(2, 5, 10, 30)),
       col = c("black", "green", "red", "orange"),
       lty = 1, lwd = 2)

F-Distribution

The F-distribution is needed to test the results of regression results.

FDist.R
curve(df(x, df1=2, df2=2), from=0, to=5)
curve(df(x, df1=5, df2=5), from=0, to=5,col = "green", add = TRUE)
curve(df(x, df1=10, df2=10), from=0, to=5,col = "red", add = TRUE)
curve(df(x, df1=30, df2=15), from=0, to=5,col = "orange", add = TRUE)
legend("topright", legend = paste0("DF = ", c(2, 5, 10, 30)),
       col = c("black", "green", "red", "orange"),
       lty = 1, lwd = 2)

$\beta$-Verteilung

The $\beta$-distribution is implemented in R.

beta.R
b1  <- seq(0, 1, by = 0.02)    
by1 <- dbeta(b1, shape1 = 5, shape2 = 20) 
plot(by1)

$\Gamma$-Verteilung

A probability density plot of the $\Gamma$ function is implemented in R.

GammaDensity.R
#define x-values
x <- seq(0, 2, by=0.01)   
#calculate gamma density for each x-value
y <- dgamma(x, shape=5) 
#create density plot
plot(y)

$\chi^2$

The $\chi$-distribution is important for testing nominal values and for identifying the correct distribution type.

chisquare.R
curve(dchisq(x, df = 2), from = 0, to = 40)
curve(dchisq(x, df = 5), from = 0, to = 40,col = "green", add = TRUE)
curve(dchisq(x, df = 10), from = 0, to = 40,col = "red", add = TRUE)
curve(dchisq(x, df = 30), from = 0, to = 40,col = "orange", add = TRUE)
legend("topleft", legend = paste0("DF = ", c(2, 5, 10, 30)),
       col = c("black", "green", "red", "orange"),
       lty = 1, lwd = 2)

Test for Difference from a Mean Value

The application of Test Theory involves 5 steps:

  1. Define your research question
  2. Define your testable hypothesis and the alternative
  3. Calculate test statistics
  4. Compare to score
  5. Decide

For an average for sample size $n>30$ the normal distribution holds.

$H_0$: The average water level of lake Ratzeburg is $6.50m$.

$H_A$: The average water level of lake Ratzeburg is not $6.50m$.

Test statistic for average is:

$$ z = \frac{\bar{x}-\mu}{\sigma_x}$$

$z$: z-value of standard distribution $\bar{x}$: average of the sample $\mu$: the average of the totality, see hypothesis $H_0$ $\sigma_x$: standard deviation of totality, in this case known

  1. We choose the significance level, here $\alpha=0.95$
  2. We decide whether the test is directed or not. It is not directed
  3. We split $1-\alpha$ to define $\alpha/2=0.025$
  4. In this case the z-score is -1.96 and +1.96
  5. We calculate the score: First we calculate the standard deviation of the sammple with $\sigma_x = \sigma / \sqrt{n}$
  6. We then calculate the test statistic with $z=(\bar{x}-\mu)/\sigma_x$
  7. We compare and decide.

An example is given in Kronthaler (2014): p. 139

For a test with n<30 samples, the t-student distribution should be used instead of the normal distribution with $n-1$ degrees of freedom.

If we do not know the standard deviation of the totality $\sigma$, we replace the standard deviation of the totality by the s.d. of the sample $s$.

$$ \sigma_{\bar{x}}=\frac{s}{\sqrt{n}}$$

The test statistic is not taken anymore from the table of the normal distribution but from the table of the t-student distribution. The test statistic is hence

$$ t = \frac{\bar{x}-\mu}{\sigma_x}$$

A worked Example can be found in Kronthaler (2014): p. 142

Mean vs. Changes

Independent samples

We assume that we have two samples that are independent. We have determined the mean value of both samples. We want to test whether these mean values differ significantly or not. Example: Grades of female and male students in 'statistics'. The zero-hypothesis reads:

$$H_0: \mu_1 - \mu_2 = 0$$ $$H_A: \bar{x}_1-\bar{x}_2 \ne 0$$

We want to test with a significance level of $\alpha=90\%$. The test is not directed.

The test statistic is defined as:

$$t=\frac{\bar{x}_1-\bar{x}_2}{\sigma_{\bar{x}_1\bar{x}_2}}$$

with $\bar{x}_1$ and $\bar{x}_2$ the mean values of both samples, $t$ the value of the t-statistic. $\sigma_{\bar{x}_1\bar{x}_2}$ is defined as:

$$\sigma_{\bar{x}_1 \bar{x}_2}=\sqrt{\left[\frac{(n_1-1) {s_1}^2 + (n_2-1) {s_2}^2}{n_1 + n_2 - 2}\right]\left[\frac{n_1+n_2}{n_1 \times n_2}\right]}$$

with $n_1$ and $n_2$ number of observations of each group and $s_1$ and $s_2$ the respective standard deviations.

If you plug this in it is:

$$t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\left[\frac{(n_1-1) {s_1}^2 + (n_2-1) {s_2}^2}{n_1 + n_2 - 2}\right]\left[\frac{n_1+n_2}{n_1 \times n_2}\right]}}$$

The test is checked with $n_1+N_2-2$ degrees of freedom.

An example is given in Kronthaler (2014): p. 146

Dependent samples

This test is often used to investigate whether a measure, an action has an impact on the mean, e.g. reducing fertilizers on average nitrate concentrations. We test the values before (pre) and after (post) the action.

Test hypotheses:

$$H_0: \mu_{pre} - \mu_{post} = 0 $$ $$H_A: \bar{x}_{pre}-\bar{x}_{post} \ne 0$$

We want to test with a significance level of $\alpha=90\%$. The test is not directed.

The test statistic for the difference test is:

$$ t = \frac{\sum d_i}{\sqrt{\frac{n \sum {d_i}^2 - \left(\sum d_i\right)^2}{n-1}}}$$

$d_i$ is the difference between the values before and after the measure or action, $n$ is the sample size.

The value is t-distributed. Degrees of freedom are $n-1$.

An example is given in Kronthaler (2014): p. 158

Correlation vs. Independence

We investigate the correlation coefficient $r$ between two variables.

Test hypotheses:

$$H_0: r = 0$$ $$H_A: r \ne 0$$

We could also test whether the correlation is positive

$$H_0: r \ge 0$$ $$H_A: r < 0$$

The test statistic for correlation is based on the Bravais-Person correlation coefficient $r$:

$$r = \frac{\sum_{i=1}^{n}(x_i-\bar{x}) \cdot (y_i- \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2 \cdot \sum_{i=1}^{n}(y_i-\bar{y})^2}}=\frac{cov(x,y)}{\sqrt{var(x) \cdot var(y)}}$$

The test statistic we use is t-student distributed:

$$t = \frac{r \times \sqrt{n-2}}{\sqrt{1-r^2}}$$

$n$ is the number of sample pairs. We chose the test value with $F=n-2$ degrees of freedom.

Varianzanalyse ANOVA

Distribution Type: Kolmogorov-Smirnoff-Test

III Extreme Value Distributions

Poisson

Discrete events with an expected mean value of $\lambda$ $t$ are *Poisson*-distributed. The probability $P(X=k)$ that $k$ occurrences of the random variable $X$ are observed is defined as:

$$P(X=k) = \frac{\lambda^k}{k\,!} \cdot e^{-\lambda}$$

If a random variable is Poisson-distributed, we write $PD(\lambda)$ or $X \propto PD(\lambda$ and in this case the expected value $E(X)$ equals the variance $Var(X)$ and corresponds to $\lambda$.

$$E(X) = Var(X) = \lambda $$

It is interesting to note and important for the understanding of the Poisson distribution that it represents the limit of the Binomial distribution, because it represents point-like, discrete events in a continuum or is applicable when similar events occur in a large number of similar and extremely small compartments. We assume that the compartments are so small that only one event occurs per compartment, independently of the occurrence in other compartments. You can approximate the Binomial distribution by a Poisson distrbution if the number of cases $n$ is larger than 50 and the probability of occurrence $\theta$ is smaller than 0.1 and $n \cdot \theta \le 10$

The Poisson distribution can be applied in the following cases

  • radioactive decay
  • defects in a surface or in a structure as long as they do not influence each other
  • typos in a text (interesting - AI-generated text would not have that distribution or a different $\lambda$
  • bacteria in a suspension
  • calls
  • sparse rainfall events in a desert
  • spase floods (per year)

Hypergeometric

log-normal

EV I-III

Gumbel

Weibull

Pearson (log-III)

Pareto-Distribution

Exponential

An important distribution for life expectancies (of materials) is the exponential distribution with the density function

$$f(t) = \lambda \cdot e^{-\lambda \cdot t}$$

for t>0 and f(t) = 0 for t<0

The distribution function is:

$$F(t)= 1-e^{-\lambda \dot t}$$

for t>0 and F(t)=0 for t<0

the mean value is $\mu=1/\lambda$ and the variance is $\sigma=1/\lambda^2$.

IV Clustering

Hierarchical

Ward's Method

V Classification

Eigenvectors and Eigenvalues

Factor Analysis

Principal Component Analysis

VI Spatial Interpolation

Minimum Curvature, Splines

Inverse Distance

Kriging

Co-Kriging

VII Time Series Analysis

ACF and PCF

AR

MA

ARMA

ARIMA

Trend-Analysis

Components

Recession-Analysis

Residence-time Analysis

n-ADE-Models

Fourier-Analyse

VIII Optimization

Least Squares

Maximum Likelihood

Nash-Sutcliffe

Bayesian Updating

IX Differential Equations

Ordinary Differential Equations

Euler-Scheme

Runge-Kutta-Scheme

Routing: Lakes, rivers

Cascades: Basins

Compartment Modeling

Partial Differential Equations

Solute Transport

Inverse Mixing Modeling (limsolve/solve.qp)

X Machine Learning

logistic regression

sparse regression

Support Vector Machine

k-means

decision-trees

MARS

/usr/www/users/uhydro/doku/data/pages/en/lectures/hima/start.txt · Last modified: 2024/11/03 11:47 by ckuells