书名:基于开源工具的数据分析基于开源工具的数据分析
作者:PhilippK.Janert
译者:
ISBN:9787564126742
出版社:东南大学出版社
出版时间:2011-5
格式:epub/mobi/azw3/pdf
页数:509
豆瓣评分: 6.6
书籍简介:
数据收集相对比较简单,而要把原始信息转化为有用的数据则需要你知道如何精确地抽取你想要的内容。通过这《基于开源工具的数据分析(影印版)》的深入讲解,那些对数据分析感兴趣的中等或者富有经验的程序员将可以学习到在商业环境中与数据打交道的技术。你将了解到如何观察数据来找出它所包含的信息,如何在概念模型里捕捉到这些想法,然后把你的理解通过商业计划、度量标准的精确报告和其他方式反馈给你所在的机构。 你将会通过每章结束部分的动手实践来慢慢体验各种概念。最重要的是,你将了解到如何思考你所希望获取的数据——而不是依赖于工具来替你思考。 . 使用图形来描述带有一个、两个或者十多个变量的数据 . 使用粗略计算以及维度和概率参数来开发概念模型 . 使用诸如模拟和聚类的集约计算方法来挖掘数据 . 通过报告、信息板和其他度量程序来让你的结论更容易理解 . 理解财务计算,包括货币时间价值 . 利用降维技术或者预测分析来克服数据分析过程中面临的挑战 . 熟悉数据分析的不同开源编程环境
作者简介:
PhilippcK.cJanert目前提供数据分析和数学模型的咨询服务,1他曾经是物理学家和软件工程师.a他是《GnuplotcincAction:UnderstandingcDatacwithcGraphs》c(Manning出版)的作者,c他为O’ReillycNetwork,cIBMcdeveloperWorks和IEEEcSoftware写过文章.a他拥有Washington大学理论物理学的博士学位
书友短评:
@ 贝塔 又在读一本看不完的书 @ 贝塔 又在读一本看不完的书
preface xiii
1 introduction 1
data analysis 1
what’s in this book 2
what’s with theworkshops? 3
what’s with the math? 4
what you’ll need 5
what’smissing 6
part i graphics: looking at data
2 a single variable: shape and distribution 11
dot and jitter plots 12
histograms and kernel density estimates 14
the cumulative distribution function 23
rank-order plots and lift charts 30
only when appropriate: summary statistics and box plots 33
workshop: numpy 38
further reading 45
3 two variables: establishing relationships 47
scatter plots 47
conquering noise: smoothing 48
.logarithmic plots 57
banking 61
linear regression and all that 62
showing what’s important 66
graphical analysis and presentation graphics 68
workshop: matplotlib 69
further reading 78
4 time as a variable: time-series analysis 79
examples 79
the task 83
smoothing 84
don’t overlook the obvious! 90
the correlation function 91
optional: filters and convolutions 95
workshop: scipy.signal 96
further reading 98
5 more than two variables: graphical multivariate analysis 99
false-color plots 100
a lot at a glance: multiplots 105
composition problems 110
novel plot types 116
interactive explorations 120
workshop: tools for multivariate graphics 123
further reading 125
6 intermezzo: a data analysis session 127
a data analysis session 127
workshop: gnuplot 136
further reading 138
part ii analytics: modeling data
7 guesstimation and the back of the envelope 141
principles of guesstimation 142
how good are those numbers? 151
optional: a closer look at perturbation theory and
error propagation 155
workshop: the gnu scientific library (gsl) 158
further reading 161
8 models from scaling arguments 163
models 163
arguments from scale 165
mean-field approximations 175
common time-evolution scenarios 178
case study: how many servers are best? 182
why modeling? 184
workshop: sage 184
further reading 188
9 arguments from probability models 191
the binomial distribution and bernoulli trials 191
the gaussian distribution and the central limit theorem 195
power-law distributions and non-normal statistics 201
other distributions 206
optional: case study—unique visitors over time 211
workshop: power-law distributions 215
further reading 218
10 what you really need to know about classical statistics 221
genesis 221
statistics defined 223
statistics explained 226
controlled experiments versus observational studies 230
optional: bayesian statistics—the other point of view 235
workshop: r 243
further reading 249
11 intermezzo: mythbusting—bigfoot, least squares,
and all that 253
how to average averages 253
the standard deviation 256
least squares 260
further reading 264
part iii computation: mining data
12 simulations 267
awarm-up question 267
monte carlo simulations 270
resampling methods 276
workshop: discrete event simulations with simpy 280
further reading 291
13 finding clusters 293
what constitutes a cluster? 293
distance and similarity measures 298
clustering methods 304
pre- and postprocessing 311
other thoughts 314
a special case:market basket analysis 316
aword ofwarning 319
workshop: pycluster and the c clustering library 320
further reading 324
14 seeing the forest for the trees: finding
important attributes 327
principal component analysis 328
visual techniques 337
kohonen maps 339
workshop: pca with r 342
further reading 348
15 intermezzo: when more is different 351
a horror story 353
some suggestions 354
what about map/reduce? 356
workshop: generating permutations 357
further reading 358
part iv applications: using data
16 reporting, business intelligence, and dashboards 361
business intelligence 362
corporate metrics and dashboards 369
data quality issues 373
workshop: berkeley db and sqlite 376
further reading 381
17 financial calculations and modeling 383
the time value of money 384
uncertainty in planning and opportunity costs 391
cost concepts and depreciation 394
should you care? 398
is this all that matters? 399
workshop: the newsvendor problem 400
further reading 403
18 predictive analytics 405
introduction 405
some classification terminology 407
algorithms for classification 408
the process 419
the secret sauce 423
the nature of statistical learning 424
workshop: two do-it-yourself classifiers 426
further reading 431
19 epilogue: facts are not reality 433
a programming environments for scientific computation
and data analysis 435
software tools 435
a catalog of scientific software 437
writing your own 443
further reading 444
b results from calculus 447
common functions 448
calculus 460
useful tricks 468
notation and basic math 472
where to go from here 479
further reading 481
c working with data 485
sources for data 485
cleaning and conditioning 487
sampling 489
data file formats 490
the care and feeding of your data zoo 492
skills 493
terminology 495
further reading 497
index 499
· · · · · ·
添加微信公众号:好书天下获取
评论前必须登录!
注册