About

Semantic search over any of interviews from Tate (since release). Transcripts may not be perfect (blame YouTube API's stringent ban on non-OAuth caption access lol). This project uses basic Python scripting, a vector database and semantic knn-search.

YouTube V3 API - Fetches and processes videos from YouTube to use as transcript backend powering semantic search.
Milvus.io / Zilliz - vector DB backend storing video transcript data and powering semantic search for the frontend.
OpenAI's text-embedding-ada-002 - used in conjunction with vector DB. Allows client more tools beyond basic keyword search. Read more on k-nearest-neighbor (KNN) algorithm.

Videos are transcribed, combined with associated metadata, and pre-processed. The transcipts are chunked and vectorized into a database by tokens and converted to text embeddings with ~ 16k dimensions. There are limitations; for those who care more about this topic, read the Milvus documentation.

Next Steps & Feedback

Some of my plans to improve this project:

Moving away from YouTube V3 API towards a faster transcribing solution. Whisper is good but expensive and pytube and other Python packages are probably going to be used once the amoutn of video content exceeds a certain storage capacity.
Adding visual elements to search experience (i.e. thumnbail generation specific to the exact timestamp) using Puppeteer or some other solution.

Feel free to send me feedback on Twitter.

Notice & License

Follow me on Twitter @vdutts7 for more content like this.
Support my open source work by sponsoring me before my API costs explode.
Independently created. Not affiliated with Andrew Tate. Not affiliated with YouTube nor any of the companies mentioned above.

Tate Pods/Interviews ~ AI

About

Next Steps & Feedback

Notice & License