WS21: Which city is the cultural capital of Europe? An intro to Apache PySpark for Big Data GeoAnalysis


Trainers: Dr. Kashif Rasul (, Shoaib Burq (

Schedule: Monday Noon 14-18

Places: 30




In this workshop we will very quickly introduce you to the Apache Spark stack and then get into the meat of performing a full featured geospatial analysis. Using OpenStreetMap data as our base our end goal will be to find the most cultural city in Western Europe! That's right! We will develop our own Cultural Weight Algorithm (TM) ;) and apply it to a set of major cities in Europe. The data will be analyzed using Apache Spark and in the process we will learn the following phases of Big Data projects:

  • Consuming: Retrieving raw data from REST API's (OpenStreetMap).
  • Preparation: Data exploration and schema creation.
  • Persistence: Storing the data for access on HDFS.
  • Feature engineering: Dimensionality reduction and feature detection.
  • Data modelling: Classification supervised and unsupervised.
  • Data Visualization: Maps, charts, pretty pictures.

I hope you will join us on this journey of exploring one of the most exciting technology stacks to come out of the good folks at the Apache Foundation.

Tags: no-sql, big data, cloud computing, machine learning, data processing, data analysis



Additional Information:

Why Spark?

Spark has quickly overtaken Hadoop as the front runner in big data analysis technologies. There are a number of reasons for this such as its support for developer friendly interactive mode, it's polyglot interface in Scala, Java, Python, and R, and the full stack of Algorithmic libraries that such language ecosystems offer.

Out of the box, Spark includes a powerful set of tools: such as the ability to write SQL queries, perform streaming analytics, run machine learning algorithms, and even tackle graph-parallel computations but what really stands out is its usability.

With it's interactive shells (in both Scala and Python) it makes prototyping big data applications a breeze.

Why PySpark?

PySpark provides integrated API bindings around Spark and enables full usage of the Python ecosystem within all the nodes of the Spark cluster with the pickle Python serialization and, more importantly, supplies access to the rich ecosystem of Python’s machine learning libraries such as Scikit-Learn or data processing such as Pandas.

During the workshop are going to use a portable Python and Spark environment built using Docker containers.