Home
Videos uploaded by user “PyData”
Eric J. Ma - An Attempt At Demystifying Bayesian Deep Learning
 
36:15
PyData New York City 2017 Slides: https://ericmjl.github.io/bayesian-deep-learning-demystified/ In this talk, I aim to do two things: demystify deep learning as essentially matrix multiplications with weights learned by gradient descent, and demystify Bayesian deep learning as placing priors on weights. I will then provide PyMC3 and Theano code to illustrate how to construct Bayesian deep nets and visualize uncertainty in their results.
Views: 18025 PyData
Steve Dower: What's coming in Python 3.5 and why you should be excited
 
40:46
PyData Seattle 2015 Overview of the newest additions to Python 3.5, being released later this year. Python 3.5, the latest installment of the language and library, is just around the corner-https://www.python.org/dev/peps/pep-0478/ (though you can try out the beta now-https://www.python.org/downloads/release/python-350b2/). This session will cover some of the new syntax and library additions that should have people excited to start using it. As a teaser (come to the session for all the details), we'll look at better asynchronous programming, simpler mathematics, easier installation, better package management, formalized type hints, flexible function calls, and more!
Views: 17448 PyData
Anna Nicanorova: Optimizing Life Everyday Problems Solved with Linear Programing in Python
 
16:27
PyData NYC 2015 Linear Optimization can be a very powerful tool to enable mathematical decision-making under constrains. This tutorial is designed on how to build a linear program optimizer in python. To make the format more entertaining, the tutorial problems are designed to tackle relevant day-to-day problems on how to optimize your vacation, see all art around museum and create optimal reading lists. Linear Optimization is a very established area in operations research famous for solving investing and transportation problems. Linear Programing and Integer programing can describe a problem where decisions are constrained by problems and the solution requires decision where one seeks to maximize/minimize objectives (basically everyday life). So it always surprised me why more people don’t use LP for solving their real life problems. Also LP/IP can replace sometimes very complex algorithms where one seeks to optimize under constrains. This is a tutorial how to use LP modeling framework in Python (using Pulp and Scipy) by giving relevant example of optimizing everyday life. It is amazing, that by properly translating the problem with algebraic expressions, we can find solutions to such relevant everyday problems as how many/which bestsellers to read in a year, which vacations to take, while keeping costs minimal and how to cover all museums in NYC. Slides available here: https://github.com/AnnaNican/optimizers
Views: 24189 PyData
Robert Meyer - Analysing user comments with Doc2Vec and Machine Learning classification
 
34:56
Description I used the Doc2Vec framework to analyze user comments on German online news articles and uncovered some interesting relations among the data. Furthermore, I fed the resulting Doc2Vec document embeddings as inputs to a supervised machine learning classifier. Can we determine for a particular user comment from which news site it originated? Abstract Doc2Vec is a nice neural network framework for text analysis. The machine learning technique computes so called document and word embeddings, i.e. vector representations of documents and words. These representations can be used to uncover semantic relations. For instance, Doc2Vec may learn that the word "King" is similar to "Queen" but less so to "Database". I used the Doc2Vec framework to analyze user comments on German online news articles and uncovered some interesting relations among the data. Furthermore, I fed the resulting Doc2Vec document embeddings as inputs to a supervised machine learning classifier. Accordingly, given a particular comment, can we determine from which news site it originated? Are there patterns among user comments? Can we identify stereotypical comments for different news sites? Besides presenting the results of my experiments, I will give a short introduction to Doc2Vec. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 18699 PyData
Ryan Henderson - One in a billion: finding matching images in very large corpora
 
14:54
PyData Berlin 2016 The goal was not only to support high write volumes of over 10k/s but also to support fast lookup of similar images around 1-2s for over 1B images. Though similar paid services and free image hashing libraries exist, this may be the first complete free open-source solution. Available at: https://github.com/ascribe/image-match image-match started as an internal project. We needed a way, given some target image, to find similar images downloaded by our web-crawler (think Tineye). So not only did we need to support fast, accurate lookup for millions or even billions of images, we also needed to facilitate very high volume insertion -- around 10k images per second. In my talk, I will cover: - The Problem: why is finding similar images hard? - Algorithm: based on this paper - Performance: but does it scale? - Alternatives
Views: 6832 PyData
David Beazley | Keynote: Built in Super Heroes
 
42:14
PyData Chicago 2016
Views: 23235 PyData
Matt Davis: A Practical Introduction to Airflow | PyData SF 2016
 
45:52
Matt Davis: A Practical Introduction to Airflow PyData SF 2016 Airflow is a pipeline orchestration tool for Python that allows users to configure multi-system workflows that are executed in parallel across workers. I’ll cover the basics of Airflow so you can start your Airflow journey on the right foot. This talk aims to answer questions such as: What is Airflow useful for? How do I get started? What do I need to know that’s not in the docs? Airflow is a popular pipeline orchestration tool for Python that allows users to configure complex (or simple!) multi-system workflows that are executed in parallel across any number of workers. A single pipeline might contain bash, Python, and SQL operations. With dependencies specified between tasks, Airflow knows which ones it can run in parallel and which ones must run after others. Airflow is written in Python and users can add their own operators with custom functionality, doing anything Python can do. Moving data through transformations and from one place to another is a big part of data science/engineering, but there are only two widely-used orchestration systems for doing so that are written in Python: Luigi and Airflow. We’ve been using Airflow (http://pythonhosted.org/airflow/) for several months at Clover Health and have learned a lot about its strengths and weaknesses. We use it to run several pipelines multiple times per day. One includes over 450 heavily linked tasks! www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 54976 PyData
Peter Prettenhofer - Gradient Boosted Regression Trees in scikit-learn
 
38:01
http://www.slideshare.net/PyData/gradient-boosted-regression-trees-in-scikit-learn-gilles-louppe This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price. I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Views: 33424 PyData
The Python ecosystem for Data Science: A guided tour - Christian Staudt
 
25:41
Description Pythonistas have access to an extensive collection of tools for data analysis. The space of tools is best understood as an ecosystem: Libraries build upon each other, and a good library fills an ecological niche by doing certain jobs well. This is a guided tour of the Python data science ecosystem, aiming to help us select the right stack for our next data-driven project. Abstract Python is on its way to becoming the lingua franca of data science, and Pythonistas have access to an impressive and extensive collection of tools for data analysis. Here, a data scientist needs to see the forest for the trees: The space of tools is best understood as an ecosystem, where libraries build upon each other, and where a good library fills an ecological niche by doing certain jobs well. This talk is a guided tour of the Python data science ecosystem. More than a list of libraries, it aims to provide some structure, classing tools by type of data, size of data, and type of analysis. In our tour, we visit a number of areas, including working with tabular data (numpy, pandas, dask, ...) and graph data (e.g. networkx), statistics (e.g. statsmodels), machine learning (scikit-learn, ...), data visualization (matplotlib, seaborn, bokeh, ...). Aspiring data scientists, and everyone else working with data, should find this useful for selecting the right tools for their next data-driven project. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 23473 PyData
What is PyData?
 
02:08
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. View Upcoming Events: https://pydata.org/events/ Find a Meetup Near You: http://meetup.com/pro/pydata Learn More: www.pydata.org https://numfocus.org/programs/pydata PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 11063 PyData
Andrew Montalenti: Beating Python's GIL to Max Out Your CPUs
 
41:41
PyData NYC 2015 Among the #1 complaints of Python in a data analysis context is the presence of the Global Interpreter Lock, or GIL. At its core, it means that a given Python program cannot easily utilize more than one core of a multi-core machine to do computation in parallel. However, fear not! To beat the GIL, you just need to be willing to adopt a little magic -- and this talk will tell you how. Beating Python's Global Interpreter Lock starts with a recognition of a searing reality: that no matter how many multi-core machines exist, most CPU-heavy computation tasks will max out even the cores available on a given large box. Once you come to terms with this fact, you realize what you actually want isn't multi-core computation, but multi-core / multi-node computation. That is, cluster-scale computing. To illustrate multi-core vs multi-node, we'll contrast Python's standard library concurrent.futures module to the IPython.parallel framework. The former allows you to go multi-core to beat the GIL, with some caveats. But the latter lets you go multi-node. We'll then explore what makes multi-node computation difficult, and illustrate it with a small Python program that reads a fast-moving data stream and processes it in parallel, using pykafka and Apache Kafka to provide the data stream. Finally, we'll explore the open source frameworks that have finally "defeated" the cluster computing challenge for Python. These are Apache Storm and Apache Spark. They each have different designs -- and different Python integration options -- but their architectures are fascinating. The good news is, as of 2015, each of these frameworks has a high-quality, production-quality Python API, including one written by the presenter and his team! You'll leave this talk with the satisfaction that whether you need to use 2 cores, 8, 32, or even 10,000 cores across hundreds of machines, you'll have a technology available and the understanding necessary to make it happen. Never let being CPU-bound be a bottleneck for your next great data exploration or scientific computing challenge! Attend this talk to beat Python's GIL not with a CPython fork, not with a PyPy STM implementation, but instead with old-fashioned distributed computation! Slides available here: http://www.slideshare.net/pixelmonkey/beating-pythons-gil-to-max-out-your-cpus
Views: 7810 PyData
Carol Willing | JupyterHub: A "things explainer overview"
 
19:12
PyData Carolinas 2016 With JupyterHub you can create a multi-user Hub which spawns, manages, and proxies multiple instances of the single-user Jupyter notebook (IPython notebook) server. JupyterHub provides single-user notebook servers to many users. For example, JupyterHub could serve notebooks to a class of students, a corporate workgroup, or a science research group.
Views: 4278 PyData
Mike Starr - Dataswarm
 
40:21
PyData SV 2014 At Facebook, data is used to gain insights for existing products and drive development of new products. In order to do this, engineers and analysts need to seamlessly process data across a variety of backend data stores. Dataswarm is a framework for writing data processing pipelines in Python. Using an extensible library of operations (e.g. executing queries, moving data, running scripts), developers programmatically define dependency graphs of tasks to be executed. Dataswarm takes care of the rest: distributed execution, scheduling, and dependency management. Talk will cover high level design, example pipeline code, and plans for the future.
Views: 7894 PyData
Marc Garcia - Towards Pandas 1.0
 
34:24
PyData London Meetup #47 Tuesday, August 14, 2018 It's been 10 years since pandas development started. In this time, pandas growth in popularity has been incredible, becoming the de-facto standard for data analysis, to the point of being responsible for 1% of StackOverflow traffic. And the development of pandas continues strong, with new features and many fixes coming at every release. In this talk we'll start with the motivation for the project. Covering its past, the current development and latest features, and what can we expect from the project in the future. Marc (https://twitter.com/datapythonista) is a pandas core developer and Python fellow. His academic background is in AI and finance. Having worked with Python for the last 12 years, and in the data field for the last 5. He's the organiser of the London Python sprints group, and a regular speaker at PyData and PyCon conferences. Sponsored & Hosted by Man AHL **** www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 8037 PyData
Brian Lange | It's Not Magic: Explaining Classification Algorithms
 
42:45
PyData Chicago 2016 As organizations increasingly make use of data and machine learning methods, people must build a basic "data literacy". Data scientist & instructor Brian Lange provides simple, visual & equation-free explanations for a variety of classification algorithms geared towards helping understand them. He shows how the concepts explained can be pulled off using Python library Scikit Learn in a few lines.
Views: 10176 PyData
Rob Story | Data Engineering Architecture at Simple
 
34:07
PyData Chicago 2016 A walk through Simple's Data Engineering stack, including lessons learned and why we chose certain tools and languages for different parts of our infrastructure.
Views: 14112 PyData
David Higgins - Introduction to Julia for Python Developers
 
33:52
PyData Berlin 2016 Julia is a performance oriented language written from the ground-up to support numerical processing and parallelisation. The basic syntax of Julia resembles a cross between Matlab and Python, but offers performance which is comparable to compiled C-code. I will present an overview of the language with particular emphasis on where Python users may benefit in using it in their daily work. Python users have long benefitted from the less verbose nature of Python, when compared with C and Fortran. However, Python was originally designed for scripting tasks, using dynamic types and widescale object orientation, neither of which features are necessarily beneficial when it comes to numerical computing. Thus, we have seen the widespread use of Python libraries for numerical computation (scipy, numpy, etc.). Julia is a new language, developed at MIT, which attempts to learn from the experience of development of Python and similar languages. The main goals are to provide a non-verbose, performance oriented language written from the ground-up to support numerical processing and parallelisation. In its most basic syntax Julia resembles a cross between Matlab and Python, but via compilation through an intermediate level representation (llvm) it offers performance which is comparable to compiled C-code. I am not going to argue that Julia is ready for primetime yet. However, it is definitely worth consideration by anyone currently resorting to cython or needing distributed access to large datasets. I will present an outline/introduction to the language, including the main benefits and current weaknesses. Of particular interest to the audience may be the fact that Python libraries are importable and callable from within Julia, allowing a continuity of existing workflow but from a Julia-based host environment. My main focus will be for a numerically literate audience who are already contending with the technical limitations of Python and are curious about the new language in town. Slides: https://github.com/daveh19/pydataberlin2016
Views: 12067 PyData
Thomas Wiecki - Probablistic Programming Data Science with PyMC3
 
39:15
PyData London 2016 Probabilistic programming is a new paradigm that greatly increases the number of people who can successfully build statistical models and machine learning algorithms, and makes experts radically more effective. This talk will provide an overview of PyMC3, a new probabilistic programming package for Python featuring intuitive syntax and next-generation sampling algorithms. Machine learning is the driving force behind many recent revolutions in data science. Comprehensive libraries provide the data scientist with many turnkey algorithms that have very weak assumptions on the actual distribution of the data being modeled. While this blackbox property makes machine learning algorithms applicable to a wide range of problems, it also limits the amount of insight that can be gained by applying them. The field of statistics on the other hand often approaches problems individually and hand-tailors statistical models to specific problems. To perform inference on these models, however, is often mathematically very challenging, and thus requires time-deriving equations as well as simplifying assumptions (like the normality assumption) to make inference mathematically tractable. Probabilistic programming is a new programming paradigm that provides the best of both worlds and revolutionizes the field of machine learning. Recent methodological advances in sampling algorithms like Markov Chain Monte Carlo (MCMC), as well as huge increases in processing power, allow for almost complete automation of the inference process. Probabilistic programming thus greatly increases the number of people who can successfully build statistical models and machine learning algorithms, and makes experts radically more effective. Data scientists can create complex generative Bayesian models tailored to the structure of the data and specific problem at hand, but without the burden of mathematical tractability or limitations due to mathematical simplifications. This talk will provide an overview of PyMC3, a new probabilistic programming package for Python featuring intuitive syntax and next-generation sampling algorithms. ---- PyData is a gathering of users and developers of data analysis tools in Python. The goals are to provide Python enthusiasts a place to share ideas and learn from each other about how best to apply our language and tools to ever-evolving challenges in the vast realm of data management, processing, analytics, and visualization. We aim to be an accessible, community-driven conference, with tutorials for novices, advanced topical workshops for practitioners, and opportunities for package developers and users to meet in person. www.pydata.org Notebook: https://gist.github.com/anonymous/9287a213fe188a79d7d7774eef79ad4d Slides: https://docs.google.com/presentation/d/1QNxSjDHJbFL7vFwQHHheeGmBHEJAo39j28xdObFY6Eo/edit Twitter: https://twitter.com/twiecki
Views: 13184 PyData
Vincent Warmerdam: Winning with Simple, even Linear, Models | PyData London 2018
 
43:32
PyData London 2018 Simple models work. Linear models work. No need for deep learning or complex ensembles, you can often keep it simple. In this talk I'll discuss and demonstrate some winning tricks that you can apply on simple, even linear models. Slides: http://koaning.io/theme/notebooks/simple-models.pdf --- www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 16519 PyData
Tetiana Ivanova: How to become a Data Scientist in 6 months | PyData London 2016
 
56:25
Tetiana Ivanova: How to become a Data Scientist in 6 Months, a Hacker's Approach to Career Planning PyData London 2016 This talk outlines my journey from complete novice to machine learning practitioner. It started in November 2015 when I left my job as a project manager, and by April 2016 I was hired as a Data Scientist by a startup developing bleeding edge deep learning algorithms for medical imagery processing. SHORT INTRO Who I am, my background and a short summary of my story. Here I will list the steps I personally took to achieve the goal I had. HOW DID I DO IT? Why I chose a “hacky” way to enter this career path. The first mover advantage, why getting a degree doesn’t always improve your career prospects. Possibly a rant on the signaling function of formal education and how that is rarely aligned with a relevant practical skill set. Some stats to back it up (best career success predictors). Examples of hacking bureaucracies/social hierarchies from my experience and elsewhere. List of things not to do and common cognitive pitfalls. Networking for nerds - how to do it right. Time management for chronic procrastinators - how to plan a self-guided project. Some notes on psychology of time discounting and need for external reinforcement, with autobiographical examples. CONCLUSION You don’t need a PhD or even a masters to do machine learning. On taking calculated risks and especially calculated exits from one’s comfort zone. Some notes on soul searching and how to choose a career that is also a passion. Reading list. Slides available here: https://www.slideshare.net/TetianaIvanova2/how-to-become-a-data-scientist-in-6-months www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 210142 PyData
James Blackburn - Python and MongoDB as a Platform for Financial Market Data
 
34:13
PyData London 2014 As businesses search for diversification by trading new financial products, it is easy for market data infrastructure to become fragmented and inconsistent. We describe how we have successfully used Python, Pandas and MongoDB to build a market data system that stores a variety of Timeseries-based financial data for research and live trading at a large systematic hedge fund. Our system has a simple, high-performance schema, a consistent API for all data access, and built-in support for data versioning and deduplication. We support fast interactive access to the data by quants, as well as clustered batch processing by running a dynamic data flow graph on a cluster.
Views: 19957 PyData
Dave Nielsen: Top 5 uses of Redis as a Database | PyData Seattle 2015
 
36:08
Dave Nielsen: Top 5 uses of Redis as a Database PyData Seattle 2015 Sponsor Talk- Redis www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 88118 PyData
JupyterLab: The Evolution of the Jupyter Notebook - Ian Rose, Grant Nestor
 
39:49
PyData LA 2018 We introduce JupyterLab, the next-generation UI developed by the Project Jupyter team, and its emerging ecosystem of extensions. JupyterLab is the next-generation Jupyter Notebook, providing a set of core building blocks for interactive computing (e.g. notebook, terminal, file browser, console) and well-designed interfaces for them that allow users to combine them in novel ways. --- www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 7903 PyData
James Powell: Design Principles | PyData DC 2016
 
41:35
PyData DC 2016
Views: 22058 PyData
Andrew Rowan - Bayesian Deep Learning with Edward (and a trick using Dropout)
 
39:20
Filmed at PyData London 2017 Description Bayesian neural networks have seen a resurgence of interest as a way of generating model uncertainty estimates. I use Edward, a new probabilistic programming framework extending Python and TensorFlow, for inference on deep neural nets for several benchmark data sets. This is compared with dropout training, which has recently been shown to be formally equivalent to approximate Bayesian inference. Abstract Deep learning methods represent the state-of-the-art for many applications such as speech recognition, computer vision and natural language processing. Conventional approaches generate point estimates of deep neural network weights and hence make predictions that can be overconfident since they do not account well for uncertainty in model parameters. However, having some means of quantifying the uncertainty of our predictions is often a critical requirement in fields such as medicine, engineering and finance. One natural response is to consider Bayesian methods, which offer a principled way of estimating predictive uncertainty while also showing robustness to overfitting. Bayesian neural networks have a long history. Exact Bayesian inference on network weights is generally intractable and much work in the 1990s focused on variational and Monte Carlo based approximations [1-3]. However, these suffered from a lack of scalability for modern applications. Recently the field has seen a resurgence of interest, with the aim of constructing practical, scalable techniques for approximate Bayesian inference on more complex models, deep architectures and larger data sets [4-10]. Edward is a new, Turing-complete probabilistic programming language built on Python [11]. Probabilistic programming frameworks typically face a trade-off between the range of models that can be expressed and the efficiency of inference engines. Edward can leverage graph frameworks such as TensorFlow to enable fast distributed training, parallelism, vectorisation, and GPU support, while also allowing composition of both models and inference methods for a greater degree of flexibility. In this talk I will give a brief overview of developments in Bayesian deep learning and demonstrate results of Bayesian inference on deep architectures implemented in Edward for a range of publicly available data sets. Dropout is an empirical technique which has been very successfully applied to reduce overfitting in deep learning models [12]. Recent work by Gal and Ghahramani [13] has demonstrated a surprising formal equivalence between dropout and approximate Bayesian inference in neural networks. I will compare some results of inference via the machinery of Edward with model averaging over neural nets with dropout training. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. We aim to be an accessible, community-driven conference, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 21441 PyData
Jeff Reback - What is the Future of Pandas
 
31:13
PyData New York City 2017 Slides: https://www.slideshare.net/JeffReback/future-of-pandas-82901487 The history and architectural decisions for the open-source pandas project. Present plans for the future direction of project.
Views: 10255 PyData
PyData Tel Aviv Meetup: Node2vec - Elior Cohen
 
21:10
PyData Tel Aviv Meetup #17 7 November 2018 Sponsored and hosted by SimilarWeb https://www.meetup.com/PyData-Tel-Aviv/ www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 2080 PyData
Jeffrey Yau: Time Series Forecasting using Statistical and Machine Learning Models | PyData NYC 2017
 
32:03
PyData New York City 2017 Time series data is ubiquitous, and time series modeling techniques are data scientists’ essential tools. This presentation compares Vector Autoregressive (VAR) model, which is one of the most important class of multivariate time series statistical models, and neural network-based techniques, which has received a lot of attention in the data science community in the past few years.
Views: 35375 PyData
Brian Kent: Density Based Clustering in Python
 
39:24
PyData NYC 2015 Clustering data into similar groups is a fundamental task in data science. Probability density-based clustering has several advantages over popular parametric methods like K-Means, but practical usage of density-based methods has lagged for computational reasons. I will discuss recent algorithmic advances that are making density-based clustering practical for larger datasets. Clustering data into similar groups is a fundamental task in data science applications such as exploratory data analysis, market segmentation, and outlier detection. Density-based clustering methods are based on the intuition that clusters are regions where many data points lie near each other, surrounded by regions without much data. Density-based methods typically have several important advantages over popular model-based methods like K-Means: they do not require users to know the number of clusters in advance, they recover clusters with more flexible shapes, and they automatically detect outliers. On the other hand, density-based clustering tends to be more computationally expensive than parametric methods, so density-based methods have not seen the same level of adoption by data scientists. Recent computational advances are changing this picture. I will talk about two density-based methods and how new Python implementations are making them more useful for larger datasets. DBSCAN is by far the most popular density-based clustering method. A new implementation in Dato's GraphLab Create machine learning package dramatically speeds up DBSCAN computation by taking advantage of GraphLab Create's multi-threaded architecture and using an algorithm based on the connected components of a similarity graph. The density Level Set Tree is a method first proposed theoretically by Chaudhuri and Dasgupta in 2010 as a way to represent a probability density function hierarchically, enabling users to use all density levels simultaneous, rather than choosing a specific level as with DBSCAN. The Python package DeBaCl implements a modification of this method and a tool for interactively visualizing the cluster hierarchy. Slides available here: https://speakerdeck.com/papayawarrior/density-based-clustering-in-python Notebooks: http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_dbscan.ipynb http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_DeBaCl.ipynb
Views: 16672 PyData
Dr Jessica Stauth: Portfolio and Risk Analytics in Python with pyfolio | PyData NYC 2015
 
36:22
Dr Jessica Stauth: Portfolio and Risk Analytics in Python with pyfolio PyData NYC 2015 Pyfolio is a recent open source library developed by Quantopian to support common financial analyses and plots of portfolio allocations over time. Pyfolio is a tear sheet that consists of various individual plots that provide a comprehensive image of the performance of a trading algorithm and features advanced statistical analyses using Bayesian modeling. (http://quantopian.github.io/pyfolio/). Python is quickly establishing itself as the lingua franca for quantitative finance. The rich stack of open source tools like Pandas, the Jupyter notebook, and Seaborn, provide quants with a rich and powerful tool belt to analyze financial data. While useful for Quantitative Finance, these general purpose libraries lack support for common financial analyses like the computation of certain risk factors (Sharpe, Fama-French), or plots of portfolio allocations over time. Pyfolio is a recent open source tool developed by Quantopian to fill this gap. At the core of pyfolio is a so-called tear sheet that consists of various individual plots that provide a comprehensive image of the performance of a trading algorithm/portfolio. In addition, the library features advanced statistical analyses using Bayesian modeling. The software can be used stand-alone, w**ith our open-source backtesting library Zipline and is available on the Quantopian platform. This talk will be a tutorial on how to get the most out of this library (http://quantopian.github.io/pyfolio/). Slides available here: http://www.slideshare.net/JessStauth/pydata-nyc-2015 Relevant GitHub repos: https://github.com/quantopian/pyfolio https://github.com/quantopian/zipline www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 38446 PyData
Implementing and Training Predictive Customer Lifetime Value Models in Python
 
36:26
Implementing and Training Predictive Customer Lifetime Value Models in Python by Jean-Rene Gauthier, Ben Van Dyke Customer lifetime value models (CLVs) are powerful predictive models that allow analysts and data scientists to forecast how much customers are worth to a business. CLV models provide crucial inputs to inform marketing acquisition decisions, retention measures, customer care queuing, demand forecasting, etc. They are used and applied in a variety of verticals, including retail, gaming, and telecom. This tutorial is separated into two parts: In the first part, we will provide a brief overview of the ins and outs of probabilistic models, which can be used to quantify the future value of a customer, and demonstrate how e-commerce companies are using the outputs of these models to identify, retain, and target high-value customers. In the second part, we will implement, train, and validate predictive customer lifetime value models in a hands-on Python tutorial. Throughout the tutorial, we will use a real-world retail dataset and go over all the steps necessary to build a reliable customer lifetime value model: data exploration, feature engineering, model implementation, training, and validation. We will also use some of the probabilistic programming language packages available in Python (e.g. Stan, PyMC) to train these models. The resulting Python notebooks will lay out the foundation for more advanced models tailored to the specifics of each business setting. Throughout the tutorial, we will give the audience additional tips on how to tweak the models to fit different business settings. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 15207 PyData
Tom Bocklisch - Conversational AI: Building clever chatbots
 
29:56
Description Most chatbots and voice skills are based on a state machine and too many if/else statements. Tom will show you how to move past that and build flexible, robust experiences using machine learning throughout the stack. Abstract Conversational software is everywhere: messaging apps have opened up APIs to bot developers and millions of consumers now own voice controlled speakers. But the tools and frameworks for building these systems are still immature. Tom will talk about Rasa, an open source machine learning framework for building conversational software. The talk will cover the algorithms Rasa uses to build flexible and robust voice and text systems, the trade offs in using supervised versus reinforcement learning, and whether it's really such a good idea to generate text with LSTMs. Outline: Components : NLU , DM , integration , NLG Overview of available tools and frameworks Describe how Rasa does NLU Motivation & a chatbot leading to state machine hell How Rasa does dialogue management. How to advance a bots capabilities - closing the loop and data collection. Current research topics and challenges. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 14700 PyData
James Powell: So you want to be a Python expert? | PyData Seattle 2017
 
01:54:11
www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 221619 PyData
Andrew Knight | Testing is Fun in Python!
 
40:41
PyData Carolinas 2016 Testing software is just as important in Python as it is in any other programming language. Rather than treat testing as a “necessary evil,” Python offers a number of versatile test frameworks to make it fun and easy. This talk will cover basic testing best practices and introduce a few of the popular frameworks, including unittest, doctest, py.test, Nose, and Avocado. Testing is vital to the success of any software, including big data and analytics code. Unfortunately, it is often regarded as a “necessary evil” – extra work that slows down progress. In this session, I will highlight how testing in Python can be fun, easy, fast, and helpful. First, I will give a brief overview of basic best practices for testing. We will talk about the difference between debugging and testing, different types of tests, how to write good test cases, and basic testing fixtures like assertions and results. I will focus on unit testing, but the concepts can be applied to higher levels of testing as well. Then, for the majority of the session, I will introduce different Python test frameworks: - unittest as the standard module for unit test classes. - doctest as a lightweight way to write short, self-documenting assertions in docstrings. - py.test as a way to write very concise test cases. - Nose as an extension of unittest with added features. - Avocado as a comprehensive framework with parameters, replay, and test discovery. This talk is designed to be useful to Python programmers of any skill level. Only a basic understanding of Python is required.
Views: 17458 PyData
Renee Teate | Becoming a Data Scientist Advice From My Podcast Guests
 
44:01
PyData DC 2016 Overwhelmed by the vast resources (of varying quality) available online for learning data science? In this talk, I compile resources from data scientists on twitter, advice from guests of my podcast, and some of my own experience to help get you started on the path to Becoming a Data Scientist. The options for learning data science online are vast and overwhelming, but it is possible to find great resources that work well for you and learn data science without going back to school if you know how to approach it. On my "Becoming a Data Scientist" podcast, I have interviewed 17 data scientists (or those on the way to becoming data scientists) about their career paths and how they learned data science. I also interact with hundreds of data scientists regularly on Twitter. In this talk, I compile the frequent advice and the best resources, and give my answers to some common questions about how to become a data scientist.
Views: 13984 PyData
Can one do better than XGBoost? - Mateusz Susik
 
23:47
Can one do better than XGBoost? Presenting 2 new gradient boosting libraries - LightGBM and Catboost Mateusz Susik Description We will present two recent contestants to the XGBoost library: LightGBM (released October 2016) and CatBoost (open-sourced July 2017). The participant will learn the theoretical and practical differences between these libraries. Finally, we will describe how we use gradient boosting libraries at McKinsey & Company. Abstract Gradient boosting proved to be a very effective method for classification and regression in the last years. A lot of successful business applications and data science contest solutions were developed around the XGBoost library. It seemed that XGBoost will dominate the field for many years. Recently, two major players have released their own implementation of the algorithm. The first - LightGBM - comes from Microsoft. Its major advantages are lower memory usage and faster training speed. The second - Catboost - was implemented by Yandex. Here, the approach was different. The aim of the library was to improve on top of the state-of-the-art gradient boosting algorithm performance in terms of accuracy. During the talk, the participants will learn about the differences in the algorithm designs, APIs and performances. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 16160 PyData
Maciej Kula - Hybrid Recommender Systems in Python
 
34:41
PyData Amsterdam 2016 Systems based on collaborative filtering are the workhorse of recommender systems. They yield great results when abundant data is available. Unfortunately, their performance suffers when encountering new items or new users. In this talk, I'm going to talk about hybrid approaches that alleviate this problem, and introduce a mature, high-performance Python recommender package called LightFM. Introduction to collaborative filtering. Works well when data is abundant (MovieLens, Amazon), but poorly when new users and items are common. Introduce hybrid approaches: metadata embeddings. This is implemented in LightFM. LightFM has a couple of tricks up its sleeve: multicore training, training with superior ranking losses. Slides available here: https://speakerdeck.com/maciejkula/hybrid-recommender-systems-at-pydata-amsterdam-2016
Views: 11474 PyData
Bhargav Srinivasa Desikan - Topic Modelling (and more) with NLP framework Gensim
 
48:26
Description https://github.com/bhargavvader/personal/tree/master/notebooks/text_analysis_tutorial This tutorial will guide you through the process of analysing your textual data through topic modelling - from finding and cleaning your data, pre-processing using spaCy, applying topic modelling algorithms using gensim - before moving on to more advanced textual analysis techniques. Abstract Topic Modelling is a great way to analyse completely unstructured textual data - and with the python NLP framework Gensim, it's very, very easy to do this. The purpose of this tutorial is to guide one through the whole process of topic modelling - right from pre-processing your raw textual data, creating your topic models, evaluating the topic models, to visualising them. Advanced topic modelling techniques will also be covered in this tutorial, such as Dynamic Topic Modelling, Topic Coherence, Document Word Coloring, and LSI/HDP. The python packages used during the tutorial will be spaCy (for pre-processing), gensim (for topic modelling), and pyLDAvis (for visualisation). The interface for the tutorial will be an Jupyter notebook. The takeaway from the tutorial would be the participants ability to get their hands dirty with analysing their own textual data, through the entire lifecycle of cleaning raw data to visualising topics. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 13150 PyData
Thomas Huijskens - Bayesian optimisation with scikit-learn
 
39:21
Filmed at PyData London 2017 Description Join Full Fact, the UK's independent factchecking charity, to discuss how they plan to make factchecking dramatically more effective with technology that exists now. Abstract Factchecking is just one solution to the multifaceted problem of fake news. The factcheckers fight is valiant but how can they keep up in such tumultuous times? Join Full Fact, the UK's independent factchecking charity, to discuss how they plan to make factchecking dramatically more effective with technology that exists now. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. We aim to be an accessible, community-driven conference, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 12279 PyData
Brian Granger, Chris Colbert & Ian Rose - JupyterLab+Real Time Collaboration
 
29:25
www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 16936 PyData
Joel Grus: Learning Data Science Using Functional Python
 
44:32
PyData Seattle 2015 Everyone has an opinion on the best way to learn data science. Some people start with statistics or machine learning theory, some use R, and some use libraries like scikit-learn. I'll use several examples to contrast these with a simpler approach using functional programming techniques in Python. In addition, I'll show how even advanced data scientists can benefit from thinking more functionally. Materials available here: Github: https://github.com/joelgrus/stupid-itertools-tricks-pydata Slides: https://docs.google.com/presentation/d/1eI60SL3UxtWfr9ktrv48-pcIkk4S7JiDmeXGCyyGhCs/edit#slide=id.p
Views: 29758 PyData
Natalie Hockham: Machine learning with imbalanced data sets
 
27:45
Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. This talk looks at these different approaches in the context of fraud detection. Full details — http://london.pydata.org/schedule/presentation/40/
Views: 20260 PyData
Leave-One-Feature-Out Importance - Rafah El-Khatib
 
05:35
PyData London Meetup #53 Tuesday, February 5, 2019 The LOFO (leave one feature out) importance calculates the importances of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, cross-validated, based on the chosen metric. Sponsored & Hosted by Man AHL **** www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 1235 PyData
Building new NLP solutions with spaCy and Prodigy - Matthew Honnibal
 
40:09
PyData Berlin 2018 In this talk, I will discuss how to address some of the most likely causes of failure for new Natural Language Processing (NLP) projects. My main recommendation is to take an iterative approach: don't assume you know what your pipeline should look like, let alone your annotation schemes or model architectures. --- www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 6540 PyData
Maciej Kula | Neural Networks for Recommender Systems
 
32:56
PyData Amsterdam 2017 Neural networks are quickly becoming the tool of choice for recommender systems. In this talk, I'm going to present a number of neural network recommender models: from simple matrix factorization, through learning-to-rank, to recurrent architectures for sequential prediction. All my examples are accompanied by links to implementations to give a starting point for further experimentation. The versatility and representational power of artificial neural networks is quickly making them the preferred tool for many machine learning tasks. The same is true of recommender systems: neural networks allow us to quickly iterate over new models and to easily incorporate new user, item, and contextual features. In this talk, I'm going to present a number of useful architectures: from simple matrix factorization in neural network form, through learning-to-rank models, to more complex recurrent architectures for sequential prediction. All my examples are accompanied by links to implementations to provide a starting point for further experimentation.
Views: 10456 PyData
Stefan Otte: Deep Neural Networks with PyTorch | PyData Berlin 2018
 
01:25:59
Learn PyTorch and implement deep neural networks (and classic machine learning models). This is a hands on tutorial which is geared toward people who are new to PyTorch. PyTorch is a relatively new neural network library which offers a nice tensor library, automatic differentiation for gradient descent, strong and easy gpu support, dynamic neural networks, and is easy to debug. Slides: https://github.com/sotte/pytorch_tutorial --- PyData Berlin 2018 www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 32795 PyData
Julie Michelman - Pandas, Pipelines, and Custom Transformers
 
34:41
Description Using pandas and scikit-learn together can be a bit clunky. For complex preprocessing, the scikit-learn Pipeline conveniently chains together transformers. But, it will convert your DataFrame to a numpy array. In this talk, we will walk through pandas DataFrames, scikit-learn preprocessing and Pipelines, and how to use custom transformers to stay in pandas land. GitHub Link: https://github.com/jem1031/pandas-pipelines-custom-transformers Abstract For data science in python, the pandas DataFrame is a common choice to store and manipulate data sets. It has named columns, each of which can contain a different data type, and an index to identify rows and assist in joining. The scikit-learn package is the major machine learning library in python. It has implementations for a wide variety of popular feature engineering, supervised, and unsupervised machine learning algorithms. Perhaps even more importantly to its success, scikit-learn provides a uniform interface for these transformers and estimators, making it easy to swap out one for another. Many scikit-learn transformers will take and return pandas DataFrames, but some only return numpy arrays. This means losing the column names and row indices. A few important examples include the meta-transformers Pipeline and FeatureUnion. The Pipeline chains together transformers to be applied in order. The FeatureUnion combines the results of transformers that can be applied in parallel. With these, the entire feature engineering process can be stored in one object and easily applied to new data sets. Luckily, scikit-learn also provides the ability to write your own custom transformers. It is as simple as defining a new class that implements the fit and transform methods. We can use this to create pandas-friendly versions of the Pipeline and FeatureUnion, as well as add transformations that are not already provided. www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 10416 PyData
William Cox:  An Intuitive Introduction to the Fourier Transform and FFT
 
32:57
PyData Seattle 2015 The “fast fourier transform” (FFT) algorithm is a powerful tool for looking at time-based measurements in an interesting way, but do you understand what it does? This talk will start from basic geometry and explain what the fourier transform is, how to understand it, why it’s useful and show examples. If you’re collecting time-series data (e.g. heart rate, stock prices, server usage, temperature) the fourier transform can be a useful tool for analyzing the underlying periodic nature of the data. But, what is it actually doing? In this talk we’ll start from the foundation of basic geometry and explain what the transform is doing. The talk will feature lots of animated graphics to take the mystery out of this powerful method … and to keep you from reading Twitter during the talk. We’ll look at example applications and example code on how to use it in practice, along with practical tips, like choosing the number of bins and what in the world “windowing” functions are. Materials available here: https://github.com/gallamine/fft_oscon/
Views: 23890 PyData
Michael Bronstein - Geometric deep learning on graphs: going beyond Euclidean data
 
31:37
PyData London Meetup #52 Tuesday, January 8, 2018 Sponsored & Hosted by Man AHL **** www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Views: 2133 PyData
Maria Nattestad: How Big Data is transforming biology and how we are using Python to make sense
 
39:44
PyData NYC 2015 Biology is experiencing a Big Data revolution brought on by advances in genome sequencing technologies, leading to new challenges and opportunities in computational biology. To address one of these challenges, we built a Python library named SplitThreader to represent complex genomes as graphs, which we are using to untangle hundreds of mutations in a cancer genome. The field of biology is in the midst of a sequencing revolution. The amount of data collected is growing exponentially, fueled by a cost of sequencing that is dropping at a rate outpacing Moore's Law. In Python terms, the human genome is a "list" containing 46 "strings" (chromosomes) for a total of 6 billion characters. Every single character can be the site of a mutation that brings you one step closer to cancer. My research is in cancer genomics, and I have been working to reconstruct the history of rearrangements that brought one patient's cancer genome from 46 chromosomes to 86. In an effort to untangle hundreds of large, overlapping mutations, we built a genomic graph library in Python named SplitThreader. I will motivate why a special graph library is needed to represent genomes and how this same library can be used to understand human genetic variation. I will also discuss some of the major challenges we are facing in genomics, how big data is introducing a new way of doing science, and how we ourselves have used Python to quickly iterate on new ideas and algorithms. This will serve as an overview of some of the challenges in computational biology. Slides available here: http://www.slideshare.net/MariaNattestad/data-and-python-in-biology-at-pydata-nyc-2015 GitHub repo here: : https://github.com/marianattestad/splitthreader
Views: 12753 PyData