ezoic

Thursday, November 18, 2021

pixel 3 vs iphone

 pixel 3 vs iphone 


1. some apps available on pixel , but not on ios

2. pixel 3 has better connection on wifi than iphone. 

If you frequently use your usb on your computer, your computer will more easily freeze up

 If you frequently use your usb on your computer, your computer will more easily freeze up. 


Sunday, August 1, 2021

logistic regression, a guide

 


https://medium.com/analytics-vidhya/a-comprehensive-guide-to-logistic-regression-e0cf04fe738c


decision boundary=cut off =threshold 

time series analysis, lags and autocorrelations

 

https://www.business-science.io/timeseries-analysis/2017/08/30/tidy-timeseries-analysis-pt-4.html

guide to time series analysis

 


https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775

bidding on click through, cost per impressions, conversion

 facebook , bidding on click through, cost per impressions, conversion? 


how to choose from the prices? bidding 

choose the cut off for binary classification

 

http://ethen8181.github.io/machine-learning/unbalanced/unbalanced.html#choosing-the-suitable-cutoff-value



why most of the time , the cut off or threshold is 0.5?


https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_logistic_roc_curves.htm




in the built-in function, the cut off is 0.5


when reporting the confusion matrix, can change the cut off to an arbituray number, and the confusion matrix will be changed accordingly, when changing the cut off, the false positive rate and false negative rate will change accordingly. 

dat <- iris dat$positive <- as.factor(ifelse(dat$Species == "setosa", "s", "ns")) library(caret) mod <- train(positive~Sepal.Length, data=dat, method="glm")




confusionMatrix(table(predict(mod, type="prob")[,"s"] >= 0.25,
                      dat$positive == "s"))
# Confusion Matrix and Statistics
# 
#        
#         FALSE TRUE
#   FALSE    88    3
#   TRUE     12   47
#                                           
#                Accuracy : 0.9             
#                  95% CI : (0.8404, 0.9429)
#     No Information Rate : 0.6667          
#     P-Value [Acc > NIR] : 2.439e-11       
#                                           
#                   Kappa : 0.7847          
#  Mcnemar's Test P-Value : 0.03887         
#                                           
#             Sensitivity : 0.8800          
#             Specificity : 0.9400          
#          Pos Pred Value : 0.9670          
#          Neg Pred Value : 0.7966          
#              Prevalence : 0.6667          
#          Detection Rate : 0.5867          
#    Detection Prevalence : 0.6067          
#       Balanced Accuracy : 0.9100        

Saturday, July 31, 2021

an interesting website for data analysis

 

https://www.digitaling.com/


it is in chinese, 

internet industry technical terms

advertising related  


1. ad network


2. ad exchange

3. RBT real time bidding

4. DSP demand side platform

5. DMP data-management platform

6. programmatic buying

7. Private market place

8. Programmatic Direct buy

9. Premium Inventory

10. Remnant Inventory

11. CPM cost per mille

12. CPC cost per click, Cost Per Thousand;Cost Per Impressions

13. CPC (Cost Per Click;Cost Per Thousand Click-Through)

14. CPA(Cost-per-Action)

15. CPS(Cost-Per-Sale)

16. CPT cost per time

17. CPV cost per visit

18. CPI cost per visit

19. CPD cost per download

20. banner

21 Interstitial

22. Native Advertising (Native Ads)

Operation related


23. AARRR :Acquisition、Activation、Retention、Revenue、Refer

24. DNU(Daily New Users)

25. CAC(Customer Acquisition Cost)

26. CPC (Cost Per Customer )

27. CR (Conversions Rates)

28. DAU(Daily Active Users)

29. WAU(Weekly Active Users)

30. MAU(Monthly Active Users)

31. DEC(Daily Engagement Count)

32. DAOT/AT(Daily Avg.Online Time)

33. DAU (Daily Active User)

34. MAU (Monthly active users)

35.Users Retention

36. Day 1/3/7/30 Retention Ratio

37. Users Churn

38. Day 1 Churn Ratio

39. Day 7 Churn Ratio

40. Day 30 Churn Ratio

41. MPR(Monthly Payment Ratio)

42.  MAU, APA

43. APA(Active Payment Account)

44. ARPU(Average Revenue per Uers)

45. ARPU

46. monthly ARPU= /MAU

47.ARPPU(Average Revenue per Paying User)

48. ARPPU

monthly ARPPU=

49. life time

50 life time value

51 PCU(Peak Concurrent Users)

52. ACU(Average Concurrent Users)

53. New Users Converstion Rate

54. SEO(Seach Engine Optimization)

55. SEM (Search Engine Marketing)

56. ASO (App Store Optimization)

57. KPI(Key performance indicators)

58. GMV(Gross Merchandise Voltume )

59. SKU (Stock Keeping Unit)

60. ‎Long Tail Keyword

61. MVP(Minimum Viable Product )

62. SP (Service Provider)

63. CP(Content Provider

64. BD (Business Development)

65. SDK (Software Development Kit)

66. UE/UE(User Experience)

67. EDM (Email Direct Marketing)

68. SNS (Social Networking Services)

69. UGC (User Generated Content)

70. PGC(Professional Generated Content)

71. OGC(Occupationally-generated Content)

72. KOL(Key Opinion Leader)


Tuesday, July 13, 2021

statistics knowledge websites

 


https://stattrek.com/


https://www.khanacademy.org/math/statistics-probability


https://brilliant.org/


https://www.datacamp.com/community/tutorials

statistical knowledge

 1. how to normalize data

2. how to detect outlier, what is IQR?

3. how to reverse a list in python

4. how to insert a number in a list in python

5. how does spark's rdd work? how is it diffrent from pyspark's dataframe?

6. how to calculate cumulative sums in a table in sql

7. what is the difference between mapreduce and in-memory?

8. what is mapreduce?

9. what is lag?

10. proceeding and in sql?

11. how to count number of data points in a numpy array?

12. how to do hyperthesis test?

13. what is false positive rate? what is false negative rate? 

14. how to delete duplicates in a dataframe in python?

15. what is false discovery rate? and  bonferroni correction?


Monday, July 12, 2021

what is false discovery rate

 what is false discovery rate?


https://www.youtube.com/watch?v=3PVkfQRUGI4


an interesting video talking about it. 

it is something we predefined in a hyperthesis testing.  a type one erro for the multiple testing we tried to control. 



a concise video about bonferroni correction

 https://www.youtube.com/watch?v=HLzS5wPqWR0


to understand bonferroni correction, first , we need to understand family-wise error rate, 

a1=type one error


FWER=1-(1-a1)^m


m is the number of tests


bonferroni correction 

corrected a1

=a1/k


k is the number of tests performed. 


FWER=1-(1-a1/k)^k








Wednesday, June 23, 2021

how to prepare for coding interviews from youtuber Data Interview Pro

https://www.youtube.com/watch?v=hAqg2dlNeUc

This video taught me how to prepare for coding interview. and here is an outline for it:

The following roles' interviews, there will be coding interviews:











types of coding interviews:




question:

find the median from an unsorted array

find the median of  "streaming" data







question:









question:


how to prepare:

1. brush up the basics







1.collect sample questions:

leetcode, glassdoor, lintcode, 

2.organize questions by 1. topics, 2. by jupyter notebook

3. solve problems:

when you stuck:

find other people's approaches, google searching
come up with multiple solutions: 

1). brute force
2). optimized version

4. Enhance understanding






























algorithmic trading using python

 https://www.youtube.com/watch?v=xfzGZB4HhEE



freecodecamp: a good coding organization 

Friday, March 26, 2021

zero-inflated models in r

 


https://cran.r-project.org/web/packages/ZIM/ZIM.pdf

log transformation , purpose and inteperation

 

https://medium.com/@kyawsawhtoon/log-transformation-purpose-and-interpretation-9444b4b049c9

Before we get into log transformation, let’s quickly talk about normal distribution. Normal distribution is a probability and statistical concept widely used in scientific studies for its many benefits. Just to name a few of these benefits— normal distribution is simple. Its mean, median and mode have the same value and it can be defined with just two parameters: mean and variance. It also has important mathematical implications such as the Central Limit Theorem.


Unfortunately, our real-life datasets do not always follow the normal distribution. They are often so skewed making the results of our statistical analyses invalid. That’s where Log Transformation comes in.


When our original continuous data do not follow the bell curve, we can log transform this data to make it as “normal” as possible so that the statistical analysis results from this data become more valid.


poisson distribution




https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/


Wednesday, March 10, 2021

statistics subject statistical computing

 statistical computing 

generate random variables



statistics subject statistical consulting skills


statistical consulting skills 


Statistical Consulting


Statistical consulting for dissertations is our sole focus.  It is required for many fields and offers extensive potentials to people in need of it.  Conducting constructive analysis and research on certain topics is highly relevant because of the competition that exists in the contemporary world today.  Statistical consulting is therefore a necessary tool for obtaining the required and significant data in many fields and domains.

Statistical consulting is necessary in the following areas:

· Science and Medicine
· Business and Commerce
· Social Sciences like Psychology and Sociology
· Government Bodies and Law
· Universities and Colleges for dissertations and theses

Statistical consulting is very popular and is applied in almost every aspect of society because it ensures adequate and successful functioning of organizations.  The activities that are associated with statistical consulting ranges, and can concern any topic.  The task of consulting varies from project to project and involves the statistician acting as the problem solver by conducting selecting the appropriate analysis, conducting analyses on the data, and interpreting the findings.  In statistical consulting, the consultant also acts as a guide and advisor to the client.

Consulting is very effective and accurate and is therefore a necessary entity in today’s day and age.  A statistician should possess certain qualities that ensure his success.  For a statistician, statistical consulting requires the following characteristics:

· Good Communication Skills: The statistician must possess good communication skills so that the consultant can interact with the client fluently and comfortably.  Once the idea is made clear to the consultant (through healthy, professional conversations with the client) the statistical consultant is able to carry on with their work professionally as per the clients needs.

· Scientific Interest: It requires a keen and eager interest in the pursuits of science.  Science forms the core root of statistics and is a fundamental feature in statistical consulting.

· Statistical Knowledge: Without proper training and education in statistics, one cannot engage in statistical consulting.  One has to be able to understand the subject and to apply the required technical and specialized techniques and procedures of statistics.

· Computer Proficiency: Basic computer skills are essential.  The statistician must be able to utilize the computer while making use of the new and latest statistical software available in the market today.

Statistical consulting necessitates that the statistician perform research studies and experiments.  It also includes designing the experiments needed for observations and interpretations.  With statistical consulting at hand, organizations need not worry themselves with the problem of obtaining the needed information.

Statistical consulting is instrumental to small scale industries in particular.  Small scale industries can gain profits through statistical consulting as the statistician gives the industry the opportunity to conduct proper researches as well as giving them a full length statistical analysis.  Without this, the company would not have the resources or knowledge to carry on with the project.

Contemporary times offer a number of possibilities to people.  The advent of statistics and statistical consulting has in many ways made things a lot easier for everyone.  Statistical consulting has brought with it an endless number of solutions for research findings and data analysis. Information is an important need and statistics have various ways of finding that information so that it may be utilized to bring about advancement and evolution. Clearly, this consulting is of crucial importance today.


statistics subject

 statistics subject

nonparametric statistics

ranking data 


normality testing methods

  • Shapiro-Wilk test.
  • Kolmogorov-Smirnov test.
  • Anderson-Darling test

compare sample means

  • Mann-Whitney U Test.
  • Wilcoxon Signed-Rank Test.
  • Kruskal-Wallis H Test.
  • Friedman Test.

statistics subject

 statistics subject

design and linear modeling 

liear regression models







time series models





Sunday, February 28, 2021

Cox Proportional hazards model, what is it for?

 In suivival analysis, S(t)=p(T>t) denote the probablity an event survived time t, called survival probablity.

Kaplan-Meier survival estimate, it is a non-parametric method to estitmate the survival probability based on the survival times. 

Log-Rank test , after KM, it is not easy to do the test still. Then we use log-rank test to compare several survival curves. 

hazard probability. 


cummulative hazard

from survival equation S(t), we can get cummulative hazard. 


Cox Proportional hazards model

One R example

covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss") univ_formulas <- sapply(covariates, function(x) as.formula(paste('Surv(time, status)~', x))) univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)}) # Extract data univ_results <- lapply(univ_models, function(x){ x <- summary(x) p.value<-signif(x$wald["pvalue"], digits=2) wald.test<-signif(x$wald["test"], digits=2) beta<-signif(x$coef[1], digits=2);#coeficient beta HR <-signif(x$coef[2], digits=2);#exp(beta) HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2) HR.confint.upper <- signif(x$conf.int[,"upper .95"],2) HR <- paste0(HR, " (", HR.confint.lower, "-", HR.confint.upper, ")") res<-c(beta, HR, wald.test, p.value) names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", "p.value") return(res) #return(exp(cbind(coef(x),confint(x)))) }) res <- t(as.data.frame(univ_results, check.names = FALSE)) as.data.frame(res)


beta HR (95% CI for HR) wald.test p.value
age       0.019            1 (1-1)       4.1   0.042
sex       -0.53   0.59 (0.42-0.82)        10  0.0015
ph.karno -0.016      0.98 (0.97-1)       7.9   0.005
ph.ecog    0.48        1.6 (1.3-2)        18 2.7e-05
wt.loss  0.0013         1 (0.99-1)      0.05    0.83

The output above shows the regression beta coefficients, the effect sizes (given as hazard ratios) and statistical significance for each of the variables in relation to overall survival. Each factor is assessed through separate univariate Cox regressions.


From the output above,

  • The variables sex, age and ph.ecog have highly statistically significant coefficients, while the coefficient for ph.karno is not significant.

  • age and ph.ecog have positive beta coefficients, while sex has a negative coefficient. Thus, older age and higher ph.ecog are associated with poorer survival, whereas being female (sex=2) is associated with better survival.




Cox model, they want to estimate if a treatment is useful? 


looking for a man

 I am a mid aged woman.  I was born in 1980. I do not have any kid. no complicated  dating before . I am looking for a man here for marriage...