## Big Data definition

**Big Data** - files that are complicated to process using conventional computer or software.

## Example
Table with more than 1,048,576 rows would be impossible to process in MS Excel. Python, R and MatLab are limited by RAM and memory.

## Solution
Use specialized environment as Metacentrum computers, specialized servers (Mazlík) or Google Collab!

# Practical example

## MICROSTRUCTURE OF INFANT FORMULA  RELATED TO ITS FUNCTION in 2D

* Infant formula 
    * pH 7
    * drag5 (Far-Red)
    * processed using starDist 
    * 660 000 rows. 35 columns of data (350+ MB csv file)
    * Dataset/Results_aptamil.csv

* Human milk (pH 7)
    * pH 7
    * drag5 (Far-Red)
    * processed using starDist
    * 1 650 000 rows. 35 columns of data (150+ MB csv file)
    * Dataset/Results_MM.csv

## Goal


1.   Load both data sets
2.   Clear NaN values
3.   Visualize Area distribution
4.   Decide statistical distribution (parametric/nonparametric)
5.   Define statistic hypothesis
6.   Compare Area distributions





## Expected results from 3D measurements

![Results](res.png)

In [1]:
!pip install -q -r requirements.txt

In [2]:
import numpy as np

import pandas as pd

import bokeh.plotting
import bokeh.io

!pip install --upgrade bokeh-catplot
import bokeh_catplot

bokeh.io.output_notebook()

Collecting bokeh-catplot
  Using cached bokeh_catplot-0.1.9-py2.py3-none-any.whl (16 kB)
Collecting xarray
  Using cached xarray-2022.12.0-py3-none-any.whl (969 kB)
Collecting numba
  Using cached numba-0.56.4-cp39-cp39-win_amd64.whl (2.5 MB)
Collecting colorcet
  Using cached colorcet-3.0.1-py2.py3-none-any.whl (1.7 MB)
Collecting pyct>=0.4.4
  Using cached pyct-0.4.8-py2.py3-none-any.whl (15 kB)
Collecting llvmlite<0.40,>=0.39.0dev0
  Using cached llvmlite-0.39.1-cp39-cp39-win_amd64.whl (23.2 MB)
Installing collected packages: pyct, llvmlite, numba, colorcet, xarray, bokeh-catplot
Successfully installed bokeh-catplot-0.1.9 colorcet-3.0.1 llvmlite-0.39.1 numba-0.56.4 pyct-0.4.8 xarray-2022.12.0




In [3]:
data1Path = "Dataset/Results_MM.csv" #@param {type:"string"}

data1=pd.read_csv(data1Path)  

In [4]:
data1.head()

Unnamed: 0,Unnamed: 1,Label,Area,Mean,StdDev,Mode,Min,Max,X,Y,...,%Area,RawIntDen,Slice,FeretX,FeretY,FeretAngle,MinFeret,AR,Round,Solidity
0,1,Data1,2541,56.0,0.0,56,56,56,598.54486,170.83176,...,100,142296.0,1,593,142,98.53077,55.75403,1.06228,0.94137,0.97319
1,2,Data1,2419,96.0,0.0,96,96,96,2718.95432,624.50827,...,100,232224.0,1,2711,653,72.47443,52.97387,1.10986,0.90102,0.97032
2,3,Data1,1855,4.0,0.0,4,4,4,237.42237,661.82938,...,100,7420.0,1,224,683,56.30993,48.0,1.03663,0.96466,0.96867
3,4,Data1,2596,65.0,0.0,65,65,65,293.91371,820.1171,...,100,168740.0,1,266,807,163.66396,56.02472,1.05496,0.9479,0.9732
4,5,Data1,1409,31.0,0.0,31,31,31,2998.58375,1046.93364,...,100,43679.0,1,2989,1068,68.42869,39.0,1.15185,0.86817,0.96739


In [5]:
data1.describe()

  diff_b_a = subtract(b, a)


Unnamed: 0,Unnamed: 1,Area,Mean,StdDev,Mode,Min,Max,X,Y,XM,...,%Area,RawIntDen,Slice,FeretX,FeretY,FeretAngle,MinFeret,AR,Round,Solidity
count,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,...,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0,1615323.0
mean,807662.0,133.1084,6671.007,281.0688,6488.793,6444.779,7109.614,1440.901,1394.038,1440.899,...,100.0,463317.1,1.0,1436.129,1392.875,104.9987,10.30235,1.249056,0.8442305,0.8899525
std,466303.7,251.635,6086.05,1091.096,6101.919,6101.029,6399.855,838.3394,824.3418,838.3402,...,0.0,730258.5,0.0,838.2947,824.3501,48.22601,6.097962,0.3688458,0.1536807,0.05326279
min,1.0,12.0,1.0,0.0,1.0,1.0,1.0,3.40909,3.16667,3.40909,...,100.0,70.0,1.0,1.0,1.0,1.50744,3.0,1.0,0.11117,0.35851
25%,403831.5,37.0,1956.0,0.0,1768.0,1723.0,2069.0,716.1416,689.6124,716.1596,...,100.0,161698.0,1.0,712.0,688.0,54.46232,7.0,1.06151,0.82184,0.87097
50%,807662.0,58.0,4868.0,0.0,4595.0,4534.0,5249.0,1434.662,1368.47,1434.676,...,100.0,316970.0,1.0,1430.0,1367.0,125.5377,8.0,1.10959,0.90124,0.90141
75%,1211492.0,116.0,9709.0,0.0,9492.0,9442.0,10456.0,2148.138,2046.322,2148.138,...,100.0,553099.0,1.0,2143.0,2045.0,144.4623,11.0,1.21678,0.94205,0.92308
max,1615323.0,24696.0,32332.0,14280.12,32332.0,32332.0,32332.0,3189.167,3226.5,3189.167,...,100.0,113549600.0,1.0,3187.0,3228.0,179.1449,161.2912,8.99484,1.0,1.0


In [6]:
data2Path = "Dataset/Results_aptamil.csv" #@param {type:"string"}

data2=pd.read_csv(data2Path)  

In [7]:
data2.head()

Unnamed: 0,Unnamed: 1,Label,Area,Mean,StdDev,Mode,Min,Max,X,Y,...,%Area,RawIntDen,Slice,FeretX,FeretY,FeretAngle,MinFeret,AR,Round,Solidity
0,1,Data2,2776,15.0,0.0,15,15,15,2574.66967,80.45245,...,100,41640.0,1,2510,79,0.84876,29.09975,4.99331,0.20027,0.94857
1,2,Data2,2628,14.0,0.0,14,14,14,1015.16629,128.37938,...,100,36792.0,1,1013,92,95.04245,53.0,1.28544,0.77795,0.9701
2,3,Data2,337,13.0,0.0,13,13,13,2602.27448,103.22404,...,100,4381.0,1,2598,93,105.25512,20.78831,1.0747,0.9305,0.92837
3,4,Data2,491,16.0,0.0,16,16,16,2038.76069,290.29837,...,100,7856.0,1,2030,280,127.69424,24.0,1.05842,0.9448,0.96464
4,5,Data2,509,5.0,0.0,5,5,5,2717.95972,401.13654,...,100,2545.0,1,2706,408,33.11134,24.0,1.09402,0.91406,0.9514


In [8]:
data2.describe()

  diff_b_a = subtract(b, a)


Unnamed: 0,Unnamed: 1,Area,Mean,StdDev,Mode,Min,Max,X,Y,XM,...,%Area,RawIntDen,Slice,FeretX,FeretY,FeretAngle,MinFeret,AR,Round,Solidity
count,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,...,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0,666214.0
mean,333107.5,107.186941,4778.728803,107.66811,4758.3396,4671.752471,4898.814561,1462.707549,1378.698306,1462.70868,...,100.0,296458.0,1.0,1458.731747,1377.911724,97.782395,8.929439,1.235981,0.834769,0.89003
std,192319.560456,303.942842,4594.148721,620.511333,4627.082933,4580.314488,4698.678348,820.313937,786.322358,820.31581,...,0.0,611949.4,0.0,820.405865,786.316653,47.704262,5.969685,0.278721,0.120508,0.043218
min,1.0,11.0,1.0,0.0,1.0,1.0,1.0,3.42593,3.02941,3.42593,...,100.0,29.0,1.0,1.0,1.0,0.84876,3.0,1.0,0.07888,0.37182
25%,166554.25,31.0,873.0,0.0,848.0,830.0,882.0,765.134703,708.711415,765.137985,...,100.0,45655.25,1.0,761.0,708.0,45.0,6.0,1.09164,0.79151,0.86957
50%,333107.5,46.0,3257.365065,0.0,3173.0,3072.0,3337.0,1470.0,1378.71967,1470.02227,...,100.0,174420.0,1.0,1466.0,1378.0,116.56505,7.0,1.15865,0.86308,0.89474
75%,499660.75,75.0,7718.0,0.0,7704.0,7545.0,7952.0,2163.67391,2044.148023,2163.6896,...,100.0,397574.5,1.0,2160.0,2044.0,135.0,9.0,1.263408,0.91605,0.9186
max,666214.0,58667.0,21305.0,10052.81247,21305.0,21305.0,21305.0,3139.10714,3086.32609,3139.10714,...,100.0,273126100.0,1.0,3136.0,3089.0,179.29268,359.46796,12.67709,1.0,1.0


use

Select columns of interest

In [9]:
d1Area = data1[['Label', 'Area','Feret','AR']]
d2Area = data2[['Label', 'Area','Feret','AR']]

d1Area.dropna(how='all')
d2Area.dropna(how='all')

Unnamed: 0,Label,Area,Feret,AR
0,Data2,2776,135.01481,4.99331
1,Data2,2628,68.26419,1.28544
2,Data2,337,22.80351,1.07470
3,Data2,491,27.80288,1.05842
4,Data2,509,27.45906,1.09402
...,...,...,...,...
666209,Data2,1787,62.64982,1.59116
666210,Data2,180,18.78829,1.38055
666211,Data2,45,8.94427,1.16188
666212,Data2,75,12.36932,1.28449


Rename Labels entry

In [10]:
d1Area = d1Area.replace({'Data1':'MM'})
d2Area = d2Area.replace({'Data2':'Aptamil'})

In [11]:
d1Area.head()

Unnamed: 0,Label,Area,Feret,AR
0,MM,2541,60.67125,1.06228
1,MM,2419,59.77458,1.10986
2,MM,1855,50.47772,1.03663
3,MM,2596,60.44005,1.05496
4,MM,1409,46.23851,1.15185


In [12]:
d1Area['Area'].describe() 

count    1.615323e+06
mean     1.331084e+02
std      2.516350e+02
min      1.200000e+01
25%      3.700000e+01
50%      5.800000e+01
75%      1.160000e+02
max      2.469600e+04
Name: Area, dtype: float64

In [13]:
df_median = d1Area['Area'].median()

# Take a look
df_median

58.0

In [14]:
d2Area['Area'].describe() 

count    666214.000000
mean        107.186941
std         303.942842
min          11.000000
25%          31.000000
50%          46.000000
75%          75.000000
max       58667.000000
Name: Area, dtype: float64

In [15]:
df_median = d2Area['Area'].median()

# Take a look
df_median

46.0

In [16]:
result = pd.concat([d1Area, d2Area])

#del d1Area, d2Area

In [17]:
result.head()



Unnamed: 0,Label,Area,Feret,AR
0,MM,2541,60.67125,1.06228
1,MM,2419,59.77458,1.10986
2,MM,1855,50.47772,1.03663
3,MM,2596,60.44005,1.05496
4,MM,1409,46.23851,1.15185


In [18]:
p = bokeh_catplot.histogram(
    data=result[np.mod(np.arange(result.index.size),3)!=0],
    cats='Label',
    val='Area'
)

bokeh.io.show(p)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  _, df["__label"] = utils._source_and_labels_from_cats(df, cats)


In [19]:
p = bokeh_catplot.strip(
    data=result[np.mod(np.arange(result.index.size),3)!=0],
    cats='Label',
    val='Area',
    horizontal=True,
    jitter=True,
    height=250
)

p = bokeh_catplot.box(
    data=result,
    cats='Label',
    val='Area',
    horizontal=True,
    whisker_caps=True,
    display_points=False,
    outlier_marker='diamond',
    #box_kwargs=dict(fill_color=None, line_color='gray'),
    #median_kwargs=dict(line_color='gray'),
    #whisker_kwargs=dict(line_color='gray'),
    # p=p,
)

bokeh.io.show(p)

In [20]:
from scipy.stats import normaltest

k2, p = normaltest(d1Area['Area'])
alpha = 1e-3
print("p = {:g}".format(p))
print('null hypothesis: Data1 (MM) comes from a normal distribution')
if p < alpha:  # null hypothesis: Data1 (MM) comes from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")

k2, p = normaltest(d2Area['Area'])
alpha = 1e-3
print("p = {:g}".format(p))
print('null hypothesis: Data2 (Aptamil) from a normal distribution')
if p < alpha:  # null hypothesis: Data2 (Aptamil) from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")

p = 0
null hypothesis: Data1 (MM) comes from a normal distribution
The null hypothesis can be rejected
p = 0
null hypothesis: Data2 (Aptamil) from a normal distribution
The null hypothesis can be rejected


Selecting non parametric test, and testing:

In [21]:
# Mann-Whitney U test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import mannwhitneyu
# import random  
from random import sample 
data1=d1Area['Area'].sample(n=100, random_state=1)
data2=d2Area['Area'].sample(n=100, random_state=1)
print('null hypothesis: data sets are from the same distribution')
# compare samples
stat, p = mannwhitneyu(data1, data2)
print('Statistics=%.3f, p=%.16f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
	print('Same distribution (fail to reject H0)')
else:
	print('Different distribution (reject H0)')

null hypothesis: data sets are from the same distribution
Statistics=6272.500, p=0.0018810635846991
Different distribution (reject H0)


In [23]:
from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,scipy,pandas,matplotlib,bokeh,statannotations"))

Last updated: 2023-01-05T13:42:29.700464+01:00

Python implementation: CPython
Python version       : 3.9.15
IPython version      : 8.8.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
CPU cores   : 40
Architecture: 64bit

watermark      : 2.3.1
numpy          : 1.23.5
scipy          : 1.10.0
pandas         : 1.5.2
matplotlib     : 3.6.2
bokeh          : 3.0.3
statannotations: 0.5.0

