Thursday, November 18, 2021

when reporting the confusion matrix, can change the cut off to an arbituray number, and the confusion matrix will be changed accordingly, when changing the cut off, the false positive rate and false negative rate will change accordingly.

dat <- iris dat$positive <- as.factor(ifelse(dat$Species == "setosa", "s", "ns")) library(caret) mod <- train(positive~Sepal.Length, data=dat, method="glm")

confusionMatrix(table(predict(mod, type="prob")[,"s"] >= 0.25,
                      dat$positive == "s"))
# Confusion Matrix and Statistics
# 
#        
#         FALSE TRUE
#   FALSE    88    3
#   TRUE     12   47
#                                           
#                Accuracy : 0.9             
#                  95% CI : (0.8404, 0.9429)
#     No Information Rate : 0.6667          
#     P-Value [Acc > NIR] : 2.439e-11       
#                                           
#                   Kappa : 0.7847          
#  Mcnemar's Test P-Value : 0.03887         
#                                           
#             Sensitivity : 0.8800          
#             Specificity : 0.9400          
#          Pos Pred Value : 0.9670          
#          Neg Pred Value : 0.7966          
#              Prevalence : 0.6667          
#          Detection Rate : 0.5867          
#    Detection Prevalence : 0.6067          
#       Balanced Accuracy : 0.9100

Saturday, July 31, 2021

an interesting website for data analysis

https://www.digitaling.com/

it is in chinese,

internet industry technical terms

advertising related

1. ad network

2. ad exchange

3. RBT real time bidding

4. DSP demand side platform

5. DMP data-management platform

6. programmatic buying

7. Private market place

8. Programmatic Direct buy

9. Premium Inventory

10. Remnant Inventory

11. CPM cost per mille

12. CPC cost per click, Cost Per Thousand;Cost Per Impressions

13. CPC (Cost Per Click;Cost Per Thousand Click-Through)

14. CPA(Cost-per-Action)

15. CPS(Cost-Per-Sale)

16. CPT cost per time

17. CPV cost per visit

18. CPI cost per visit

19. CPD cost per download

20. banner

21 Interstitial

22. Native Advertising (Native Ads)

Operation related

23. AARRR ：Acquisition、Activation、Retention、Revenue、Refer

24. DNU(Daily New Users)

25. CAC（Customer Acquisition Cost）

26. CPC (Cost Per Customer )

27. CR (Conversions Rates)

28. DAU(Daily Active Users)

29. WAU(Weekly Active Users)

30. MAU(Monthly Active Users)

31. DEC(Daily Engagement Count)

32. DAOT/AT(Daily Avg.Online Time)

33. DAU (Daily Active User)

34. MAU (Monthly active users)

35.Users Retention

36. Day 1/3/7/30 Retention Ratio

37. Users Churn

38. Day 1 Churn Ratio

39. Day 7 Churn Ratio

40. Day 30 Churn Ratio

41. MPR(Monthly Payment Ratio)

42. MAU, APA

43. APA(Active Payment Account)

44. ARPU(Average Revenue per Uers)

45. ARPU

46. monthly ARPU= /MAU

47.ARPPU(Average Revenue per Paying User)

48. ARPPU

monthly ARPPU=

49. life time

50 life time value

51 PCU(Peak Concurrent Users)

52. ACU(Average Concurrent Users)

53. New Users Converstion Rate

54. SEO(Seach Engine Optimization)

55. SEM (Search Engine Marketing)

56. ASO (App Store Optimization)

57. KPI(Key performance indicators)

58. GMV(Gross Merchandise Voltume )

59. SKU (Stock Keeping Unit)

60. ‎Long Tail Keyword

61. MVP(Minimum Viable Product )

62. SP (Service Provider)

63. CP（Content Provider

64. BD (Business Development)

65. SDK (Software Development Kit)

66. UE/UE(User Experience)

67. EDM (Email Direct Marketing)

68. SNS (Social Networking Services)

69. UGC (User Generated Content)

70. PGC(Professional Generated Content)

71. OGC(Occupationally-generated Content)

72. KOL(Key Opinion Leader)

Friday, July 23, 2021

CS50 a fabulous class

https://www.youtube.com/c/cs50/videos

good for beginners

Wednesday, July 21, 2021

time series, how to choose lag

Tuesday, July 13, 2021

statistics knowledge websites

https://stattrek.com/

https://www.khanacademy.org/math/statistics-probability

https://brilliant.org/

https://www.datacamp.com/community/tutorials

statistical knowledge

1. how to normalize data

2. how to detect outlier, what is IQR?

3. how to reverse a list in python

4. how to insert a number in a list in python

5. how does spark's rdd work? how is it diffrent from pyspark's dataframe?

6. how to calculate cumulative sums in a table in sql

7. what is the difference between mapreduce and in-memory?

8. what is mapreduce?

9. what is lag?

10. proceeding and in sql?

11. how to count number of data points in a numpy array?

12. how to do hyperthesis test?

13. what is false positive rate? what is false negative rate?

14. how to delete duplicates in a dataframe in python?

15. what is false discovery rate? and bonferroni correction?

Monday, July 12, 2021

what is false discovery rate

what is false discovery rate?

https://www.youtube.com/watch?v=3PVkfQRUGI4

an interesting video talking about it.

it is something we predefined in a hyperthesis testing. a type one erro for the multiple testing we tried to control.

a concise video about bonferroni correction

https://www.youtube.com/watch?v=HLzS5wPqWR0

to understand bonferroni correction, first , we need to understand family-wise error rate,

a1=type one error

FWER=1-(1-a1)^m

m is the number of tests

bonferroni correction

corrected a1

=a1/k

k is the number of tests performed.

FWER=1-(1-a1/k)^k

Saturday, July 10, 2021

false discovery rate , how to calculate , The Bonferroni correction

https://support.sas.com/resources/papers/proceedings/proceedings/sugi31/190-31.pdf

https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf

Thursday, July 1, 2021

two interesting youtube blogs for data analytics

testing theory

https://www.youtube.com/c/TestingTheory/videos

data interview pro

https://www.youtube.com/c/DataInterviewPro/videos

Wednesday, June 23, 2021

how to prepare for coding interviews from youtuber Data Interview Pro

https://www.youtube.com/watch?v=hAqg2dlNeUc

This video taught me how to prepare for coding interview. and here is an outline for it:

The following roles' interviews, there will be coding interviews:

types of coding interviews:

question:

find the median from an unsorted array

find the median of "streaming" data

question:

how to prepare:

1. brush up the basics

https://www.khanacademy.org/

https://brilliant.org/

1.collect sample questions:

leetcode, glassdoor, lintcode,

2.organize questions by 1. topics, 2. by jupyter notebook

3. solve problems:

when you stuck:

find other people's approaches, google searching

come up with multiple solutions:

1). brute force

2). optimized version

4. Enhance understanding

https://www.hackerrank.com/

https://leetcode.com/

https://www.lintcode.com/

algorithmic trading using python

https://www.youtube.com/watch?v=xfzGZB4HhEE

freecodecamp: a good coding organization

Saturday, May 15, 2021

survival analysis with R

https://www.datacamp.com/community/tutorials/survival-analysis-R

Tuesday, April 20, 2021

An interesting statistics book

https://mc-stan.org/docs/2_20/stan-users-guide/zero-inflated-section.html

Wednesday, April 7, 2021

python books

编写高质量Python代码的59个有效方法

https://edu.heibai.org/Effective+Python.%E7%BC%96%E5%86%99%E9%AB%98%E8%B4%A8%E9%87%8FPython%E4%BB%A3%E7%A0%81%E7%9A%8459%E4%B8%AA%E6%9C%89%E6%95%88%E6%96%B9%E6%B3%95.Brett+Slatkin.pdf

编写高质量代码改善 Python 程序的 91 个建议

https://l1nwatch.gitbook.io/writing_solid_python_code_gitbook/

Cox regression and sas phreg

https://www.statisticshowto.com/cox-regression-model/

https://web.stanford.edu/~kcobb/stats210/lab10.pdf

Friday, March 26, 2021

zero-inflated models in r

https://cran.r-project.org/web/packages/ZIM/ZIM.pdf

log transformation , purpose and inteperation

https://medium.com/@kyawsawhtoon/log-transformation-purpose-and-interpretation-9444b4b049c9

Before we get into log transformation, let’s quickly talk about normal distribution. Normal distribution is a probability and statistical concept widely used in scientific studies for its many benefits. Just to name a few of these benefits— normal distribution is simple. Its mean, median and mode have the same value and it can be defined with just two parameters: mean and variance. It also has important mathematical implications such as the Central Limit Theorem.

Unfortunately, our real-life datasets do not always follow the normal distribution. They are often so skewed making the results of our statistical analyses invalid. That’s where Log Transformation comes in.

When our original continuous data do not follow the bell curve, we can log transform this data to make it as “normal” as possible so that the statistical analysis results from this data become more valid.

poisson distribution

https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/

Monday, March 15, 2021

Friday, March 12, 2021

R for categorical data analysis

https://people.umass.edu/biep640w/pdf/R_for_Categorical_Data_Analysis.pdf

Wednesday, March 10, 2021

statistics subject statistical computing

statistical computing

generate random variables

statistics subject statistical consulting skills

statistical consulting skills

Statistical Consulting

Statistical consulting for dissertations is our sole focus. It is required for many fields and offers extensive potentials to people in need of it. Conducting constructive analysis and research on certain topics is highly relevant because of the competition that exists in the contemporary world today. Statistical consulting is therefore a necessary tool for obtaining the required and significant data in many fields and domains.

Statistical consulting is necessary in the following areas:

· Science and Medicine
· Business and Commerce
· Social Sciences like Psychology and Sociology
· Government Bodies and Law
· Universities and Colleges for dissertations and theses

Statistical consulting is very popular and is applied in almost every aspect of society because it ensures adequate and successful functioning of organizations. The activities that are associated with statistical consulting ranges, and can concern any topic. The task of consulting varies from project to project and involves the statistician acting as the problem solver by conducting selecting the appropriate analysis, conducting analyses on the data, and interpreting the findings. In statistical consulting, the consultant also acts as a guide and advisor to the client.

Consulting is very effective and accurate and is therefore a necessary entity in today’s day and age. A statistician should possess certain qualities that ensure his success. For a statistician, statistical consulting requires the following characteristics:

· Good Communication Skills: The statistician must possess good communication skills so that the consultant can interact with the client fluently and comfortably. Once the idea is made clear to the consultant (through healthy, professional conversations with the client) the statistical consultant is able to carry on with their work professionally as per the clients needs.

· Scientific Interest: It requires a keen and eager interest in the pursuits of science. Science forms the core root of statistics and is a fundamental feature in statistical consulting.

· Statistical Knowledge: Without proper training and education in statistics, one cannot engage in statistical consulting. One has to be able to understand the subject and to apply the required technical and specialized techniques and procedures of statistics.

· Computer Proficiency: Basic computer skills are essential. The statistician must be able to utilize the computer while making use of the new and latest statistical software available in the market today.

Statistical consulting necessitates that the statistician perform research studies and experiments. It also includes designing the experiments needed for observations and interpretations. With statistical consulting at hand, organizations need not worry themselves with the problem of obtaining the needed information.

Statistical consulting is instrumental to small scale industries in particular. Small scale industries can gain profits through statistical consulting as the statistician gives the industry the opportunity to conduct proper researches as well as giving them a full length statistical analysis. Without this, the company would not have the resources or knowledge to carry on with the project.

Contemporary times offer a number of possibilities to people. The advent of statistics and statistical consulting has in many ways made things a lot easier for everyone. Statistical consulting has brought with it an endless number of solutions for research findings and data analysis. Information is an important need and statistics have various ways of finding that information so that it may be utilized to bring about advancement and evolution. Clearly, this consulting is of crucial importance today.

statistics subject

nonparametric statistics

ranking data

normality testing methods

Shapiro-Wilk test.
Kolmogorov-Smirnov test.
Anderson-Darling test

compare sample means

Mann-Whitney U Test.
Wilcoxon Signed-Rank Test.
Kruskal-Wallis H Test.
Friedman Test.

statistics subject

design and linear modeling

liear regression models

time series models

Friday, March 5, 2021

statistical procedures

https://www.ncss.com/software/ncss/procedures/

Sample size calculation and power analysis

sample size calculation and power analysis from toward data science

https://towardsdatascience.com/experiment-sample-size-calculation-using-power-analysis-81cb1bc5f74b

Thursday, March 4, 2021

sample size and power analysis

http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf

https://www.uncg.edu/mat/qms/PSS%20Workshop%20Document%202018.pdf

Wednesday, March 3, 2021

A interesting youtube channel for survival analysis, and an interesting webpage for survival analysis

youtube channel for survival analysis

https://www.youtube.com/c/marinstatlectures/videos

a webpage for survival analysis

https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html

Tuesday, March 2, 2021

Survival analysis , a comprehensive paper

https://towardsdatascience.com/practical-survival-analysis-concepts-techniques-models-10819fd49b4a

Sunday, February 28, 2021

Cox Proportional hazards model, what is it for?

In suivival analysis, S(t)=p(T>t) denote the probablity an event survived time t, called survival probablity.

Kaplan-Meier survival estimate, it is a non-parametric method to estitmate the survival probability based on the survival times.

Log-Rank test , after KM, it is not easy to do the test still. Then we use log-rank test to compare several survival curves.

hazard probability.

cummulative hazard

from survival equation S(t), we can get cummulative hazard.

Cox Proportional hazards model

One R example

covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss") univ_formulas <- sapply(covariates, function(x) as.formula(paste('Surv(time, status)~', x))) univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)}) # Extract data univ_results <- lapply(univ_models, function(x){ x <- summary(x) p.value<-signif(x$wald["pvalue"], digits=2) wald.test<-signif(x$wald["test"], digits=2) beta<-signif(x$coef[1], digits=2);#coeficient beta HR <-signif(x$coef[2], digits=2);#exp(beta) HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2) HR.confint.upper <- signif(x$conf.int[,"upper .95"],2) HR <- paste0(HR, " (", HR.confint.lower, "-", HR.confint.upper, ")") res<-c(beta, HR, wald.test, p.value) names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", "p.value") return(res) #return(exp(cbind(coef(x),confint(x)))) }) res <- t(as.data.frame(univ_results, check.names = FALSE)) as.data.frame(res)

beta HR (95% CI for HR) wald.test p.value
age       0.019            1 (1-1)       4.1   0.042
sex       -0.53   0.59 (0.42-0.82)        10  0.0015
ph.karno -0.016      0.98 (0.97-1)       7.9   0.005
ph.ecog    0.48        1.6 (1.3-2)        18 2.7e-05
wt.loss  0.0013         1 (0.99-1)      0.05    0.83

The output above shows the regression beta coefficients, the effect sizes (given as hazard ratios) and statistical significance for each of the variables in relation to overall survival. Each factor is assessed through separate univariate Cox regressions.

From the output above,
The variables sex, age and ph.ecog have highly statistically significant coefficients, while the coefficient for ph.karno is not significant.
age and ph.ecog have positive beta coefficients, while sex has a negative coefficient. Thus, older age and higher ph.ecog are associated with poorer survival, whereas being female (sex=2) is associated with better survival.

Cox model, they want to estimate if a treatment is useful?

ezoic