TCSE: Entity Search and 6,400 Talks

I have been working on a round of updates to TCSE (TED Corpus Search Engine), a search tool for TED Talk transcripts with translations in 34 languages. It started as a teaching and research tool for corpus linguistics, and has grown into something with over 6,400 searchable talks.

This update focused on three areas: search capabilities, performance, and reducing external dependencies.

Search

Named entity recognition is now integrated into the search. You can search for patterns like %PERSON said or %ORG announced, and combine entity types with POS filters.
A new collocation network visualization shows word associations as an interactive force-directed graph. Collocation statistics include mutual information, t-score, and difference of proportions.
Construction search has been reorganized, with NER-based patterns added alongside the existing grammatical pattern categories.
KWIC (Key Word In Context) concordance view is now available as an alternative display mode.

Performance

Collocation network queries went from 31 database calls to 2 batch queries, reducing response time from around 10 seconds to 1-2 seconds.
Composite database indexes on the token table improved advanced search speed.
Talk search now uses pre-computed tsvector full-text search instead of on-the-fly keyword aggregation.

Infrastructure

All JavaScript and CSS dependencies (jQuery, Bootstrap, D3.js, hls.js) are now bundled locally instead of loaded from CDNs.
The UI now supports four languages: English, Japanese, Chinese, and Korean.

Tomorrow I am presenting a study on English adjective subjectivity and construction selection at the JAECS Lexicography SIG workshop, and some of the corpus data behind that study comes from TCSE.