interesting (recent) reads & listens

Venkat Ramaraju

Venkat Ramaraju

engineer @ tabapay working on all things payments.

interested in cross-lingual llms.

experience

education

projects

some large efforts, some weekend hacking projects

  • polydb: a vector database written from scratch in go

    trained an embedding model from scratch via sgns + pytorch. apiserver communicates with vector services via grpc. more training runs in progress.

    at some point, i may completely overhaul the model and train it to align semantically similar sentences from different languages into similar vector spaces. this would allow users to search various documents in whatever language they would like.

  • flowcast: an xgboost model that predicts 15-minute net bike flow for lyft bike stations in the bay area

    1. trained an xgboost model on 4 years of lyft bike rides to predict net bike flow throughout the day for each station based on weather, day/time and other signals.
    2. achieved a MAE of 1.07 on the validation set.
    3. built a fullstack app (fastapi + react) to interactively run model inference.
    4. need to increase feature vectors, perhaps adding information about ongoing events in the area of the station.

  • polyglot: a multilingual tokenizer implemented from scratch in go via the byte-pair encoding algorithm

    achieves uniform compression and fertility across across 10 diverse scripts. training to achieving 5.0 compression in progress.

  • venkbot: a personal agentic toolkit to automate mundane tasks in my life

    whatsapp chat (twilio) hits my self-hosted server; an llm-backed dispatcher invokes the right mix of bespoke tools + mcps to perform the task.

  • dataquest.ai: an authenticated ai natural language querying tool for documents, datasets, videos, emails, etc.
    this application implements rate limiting, request caching, and connection pooling from scratch.

    leverages pinecone, langchain with gpt3.5 turbo, gmail api, youtube transcription api, and stripe api integrations.

  • whaletracker: realtime whale trade tracker with polymarket websockets

    dynamically discovers and subscribes to new markets, computes a spread asymmetry pressure metric, stores data in redis, sends email alerts on threshold breach

  • fb-finetuned: finetuned gpt-oss-20b on 1.5 yrs of my facebook texts to learn my texting style

    1. built an dataset generation agent with langgraph + ollama (llamab3.2b)
    2. finetuned gpt-oss-20b with peft + lora (shoutout unsloth!)
    3. will create a stt pipeline soon with the finetuned model

  • zillow-bot: a bot that emails you new weekly zillow postings based on your search criteria

    1. rapid api for access to zillow data
    2. s3 to store reports
    3. weekly cronjob set up with github actions for workflow automation

  • interspersed bilingual decoder (ibd): a code-mixing decoder-only model fine tuned on top of LLama-3.1.

    generated hinglish code-mixed datasets using pos tagging with stanza/spacy, performed sft + dpo for alignment.

  • agora: stock recommender based on public sentiment

    uses VADER sentiment analysis models, daily web scrapers using selenium, yfinance api, xgboost and random forest ensemble models

    big shoutout to Dr. Ajay Bansal and his PhD student James Smith for their support in elevating this project.

papers