About the event

This full day workshop offers a hands-on introduction to common big data tools, such as Hadoop, Spark, and Kafka, applied to stored (batch) and real-time (streaming) processing of healthcare data. The first half will introduce Google Cloud Platform (GCP) and distributed (cloud) storage systems, along with Spark libraries for machine learning prediction and classification problems. The second half will concentrate on real-time processing in Kafka, based on synthetic data generation and interactive analytical dashboards. A drop-in session will be held after the workshop sessions to address technical questions. The focus of the workshop is how to orchestrate big data pipelines by combining different tools and libraries, rather than on interpreting results.

Prerequisite knowledge
This workshop makes use of Linux commands and Python programming. Some experience in any of these tools may be helpful but not necessarily required. All example code and commands will be provided.

Intended audience

  • Anyone working in data science or healthcare who wants to gain practical experience with big data tools; this could be for use in an academic or industry role, in a dissertation or publication.
  • MSc or PhD students (e.g. in data science, health informatics or epidemiology) gaining skills in preparation for roles in academia or industry.
  • Clinical researchers who wish to practice big data tools in a more controlled environment before deploying in trusted research environments on real datasets.
  • The workshop emphasis is on orchestrating big data pipelines rather than interpreting results.

Capacity
30 people

Agenda

  • 09:00 – 09:25: Welcome and goals
    – Objectives of the workshop
    – Platform walkthrough (console, cloud shell, APIs, budget/billing)
    – Datasets and use cases

    09:30 – 10:00: Cloud storage
    – Google cloud storage
    – Hadoop interface and usage

    10:05 – 10:45Spark in GCP
    – PySpark notebooks
    – Data exploration and preprocessing in Spark

    10:45 – 10:55: Break

    11:00 – 12:00: Spark in GCP
    – Machine learning pipelines in Spark (MLlib estimators and transformers)
    – Feature extractors and transformers
    – Classification use case
    – Prediction use case

    12:00 – 12:30: Recap, Q&A, formative assignment
    – Storage, Dataproc and Spark APIs (Dataframes, SQL, MLlib)
    – Q&A and troubleshooting

    12:30 – 13:30: Lunch break

    13:30 – 13:45: Recap on Kafka architecture
    – Recap on Kafka architecture: brokers, topics, producers, and consumers

    • Kafka installation & setup

    13:50 – 14:10: Kafka setup and use case
    – Use case: real-time patient monitoring
    – Discussion on the necessary infrastructure (topics, producers, and consumers)

    14:15 – 15:00: Producer and logging
    – Producer logic
    – Schema design for log storage
    – BigQuery fundamentals

    15:00 – 15:10: Break

    15:15 – 16:00: Consumer and data visualisation
    – Consumer logic
    – Looker Studio fundamentals
    – BigQuery structured queries

    16:05 – 16:30: Recap, Q&A and formative assignment
    – Streaming data processing
    – Kafka architecture
    – Q&A and troubleshooting

    16:35 – 17:00: Further training and supporting materials

    – Discussion of available resources and other sources of training
    – Community of practice and networking

  • 10:00- 12:00: Drop-in session
    – Online drop-in session for technical questions regarding GCP and covered use cases.

Register here

To register for this workshop please click below:

Register Here