ACM Multimedia 2015 Workshop

Speech, Language and Audio in Multimedia

Speech, language and audio meet computer vision

SLAM'15 Preliminary Program

9 :00

Workshop introduction

9 :15 - 10:15


David Dean, Queensland University of Technology

SAIVT-BNEWS: An Australian broadcast news video dataset of entity extraction, and more

10:15  - 10:45

Coffee break

10:45 - 12:45

Morning Session

10 :45

Predicting music popularity patterns based on musical complexity and early stage popularity
Junghyuk Lee and Jong-Seok Lee

11 :15

SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents
Damiano Spina, Johanne R. Trippas, Lawrence Cavedon and Mark Sanderson

11 :45

Acoustic adaptation in cross database audio visual SHMM training for phonetic spoken term detection
Shahram Kalantari, David Dean, Sridha Sridharan, Houman Ghaemmaghami and Clinton Fookes

12 :15

Evaluation Data, Benchmarks, and Activities for Cascaded Speech Recognition and Extraction of 35 Entities: Content Capturing, Segmentation, and Structuring of Verbal Clinical Handover
Liyuan Zhou, Hanna Suominen and Leif Hanlen


Score Propagation based on Similarity Shot Graph for Improving Visual Object Retrieval
Juan Manuel Barrios and Jose M. Saavedra



12:45 - 14:00

Lunch break



14:00 - 16:00

Hyperlinking session : Vision meets speech and language

14 :00

Convenient Discovery of Archived Video Using Audiovisual Hyperlinking
Roeland Ordelman, Robin Aly, Maria Eskevich, Benoît Huet and Gareth Jones

14 :30

Audio Information for Hyperlinking of TV content
Petra Galuščáková and Pavel Pecina

15 :00

Hierarchical topic models for language-based video hyperlinking
Anca-Roxana Simon, Guillaume Gravier, Pacale Sébillot, Rémi Bois, Emmanuel Morin and Sien Moens

15 :30

Exploring Video Hyperlinking in Broadcast Media
Maria Eskevich, Quoc-Minh Bui, Hoang-An and Benoît Huet

16:00 - 17:00

Round table discussion


SAIVT-BNEWS: An Australian broadcast news video dataset for entity extraction, and more
David Dean, Queensland University of Technology, Australia

Recently QUT have released a set of annotated broadcast news videos (SAIVT­BNEWS) that we have made available at our website. This presentation will outline the dataset itself, covering 50 or so short news clips surrounding a single political event with many entities appearing in multuple records, and cover interesting research that QUT has, is currently, and is interesting in performing on this dataset in the future. This presentation will cover existing published research, including image processing tasks like face detection, face recognition, face clustering; and speech processing tasks (including the use of visual speech) like speech detection, speaker recognition, and speaker diarisation. We have also started very interesting research on fusing multiple sources of information, including metadata, OCR, faces, speech, scene detection to improve the performance of many techniques, but with a focus on improving the automatic extraction of entities (people, places, companies and organisations) from large volumes of audio­visual data, and this will also be covered. As this dataset is publically available for free to all researchers, QUT hopes that other researchers will also be able to make use of, and improve upon this dataset as well.

Dr David Dean is a Senior Research Fellow at the Queensland University of Technology with extensive publication across a wide range of audio and visual speech processing areas, with a focus on speaker diarisation, verification and keyword spotting across multimedia archives. Since completing his PhD in 2008, on Synchronous HMMs for Audio­Visual Speech Processing, Dr Dean has worked on a wide range of research projects funded by Industry, ARC and CRCs, and assisted to completion 4 PhD research programs.


The SLAM workshop series is organized by  the Special Interest Group  on Speech and Language in Multimedia of the Intl. Speech Communication Association, with support from the IEEE SIG on Audio and Speech Processing in Multimedia.

