Kazakhstani Startup Finds a Smarter Way to Collect Data for GenAI and Is Raising $3M

Tseren Andzhukayev, Nuraly Zhanbyrbayev, Dmitry Sandzhiyev, and Artem Ivanov figured out a way to quickly collect large amounts of data for training AI models, and went on to launch NCSpeech. The company is already profitable. In April, it also won Alem.ai Battle, Kazakhstan’s national AI competition. The team received the award from President Kassym-Jomart Tokayev himself and took home a 10 million tenge prize.

For the joint Digital Business and Astana Hub project, 100 Startup Stories from Central Eurasia, Tseren and Dmitry explained how the AI data market works, why companies need millions of real-life photos, videos, and audio recordings, and why taxis and delivery services are such a convenient way to collect them.

“Tech people often find it hard to juggle coding and B2B sales”

– What were you doing before founding NCSpeech?

Tseren: – I come from an academic background and have been working in machine learning and AI for a long time, including voice technologies. I did research at R&D centers for private companies and published academic papers, which now have more than 500 citations. I also worked with Artem Ivanov at Ayta AI, a startup founded by inDrive founder Arsen Tomsky. We were building a product for people who stutter.

Nuraly Zhanbyrbayev is also an engineer. He worked with major corporate clients in the banking sector and built voice services for companies.

Dmitry Sandzhiyev is the only team member without a technical background, but he brings a lot of entrepreneurial experience. For us tech people, it was tough at the beginning to juggle coding, B2B sales, and building relationships with corporations. When Dima joined with his business expertise, it made the team much stronger.

– What exactly is the startup building?

Tseren: – NCSpeech is a B2B platform that helps AI developers quickly collect and prepare datasets.

Why does that matter? Today, building algorithms and training models has become a solvable, almost routine task. The real value on the path to AGI, or artificial general intelligence, is high-quality training data and computing power. We want to focus on the first part: data.

Tseren Andzhukayev

– Where did the idea come from?

Tseren: – We started NCSpeech in March 2025, but at first, we were going to large companies with ready-made voice AI solutions. But in negotiations, we kept hearing the same thing: training the model wasn’t the main problem. The much harder part was collecting, cleaning, and labeling high-quality data.

Last year, Nuraly explained the problem really well using banks as an example. Banks need closed AI solutions because, due to privacy policies, they can’t send customer data through APIs to OpenAI or Google. But even if they could, global models still don’t work that well with Kazakh, especially when people mix Kazakh and Russian. For a voice service to properly understand the local language, accents, dialects, and natural everyday speech, you first need a large volume of verified and labeled recordings.

Nuraly Zhanbyrbayev and Artem Ivanov

We collected that kind of data for Kazakh-language voice models, and later ran into similar challenges with Malay, Filipino, and Vietnamese. That’s how we gradually realized that our real value and expertise were in working with data.

Dmitry: – One good example is a case with a Malaysian bank. They wanted to add voice controls to their banking app. But because several languages are widely spoken in Malaysia, global models fail to understand what users are saying about half the time.

Dmitry Sandzhiyev

We collected the data, brought in local linguists to check the phrases, and trained a local model that could actually understand the bank’s customers. To make it work, we built our own multi-agent platform. It assigns tasks to contributors, collects the results, and checks the quality of the work. If the system sees that a task was done poorly, it sends it back to the contributor with comments.

That story helped us see the bigger picture: there are plenty of similar problems across Southeast and Central Asia, Africa, the Middle East, and Latin America. These are regions with a huge number of languages and cultures that are still underrepresented in global AI datasets.

At the same time, public datasets online have pretty much been exhausted. Companies building self-driving cars, delivery robots, or smart cameras need photos and videos from real-life environments. For example, where do you find data on what a specific street in Shymkent actually looks like? You don’t just need random images from open sources. You need a proper dataset, thousands of fresh, up-to-date images.

Artem Ivanov

So with NCSpeech, we decided to focus on data collection. That was the idea the team took to Alem.ai Battle, the national AI project competition. In the startup category, NCSpeech won the top prize of 10 million tenge.

It feels like we’ve found product-market fit: our expertise lines up perfectly with what the market needs.

“We’re getting ready to launch a pilot with a major super app in Kazakhstan”

– How can you collect data?

Dmitry: – Where do you find people who are out in real city environments and also have some free time? In the big apps people already use every day: taxi, delivery, marketplaces, fintech products. Often, these services are bundled into super apps, where several services are available inside one ecosystem.

Couriers and drivers spend 30-40% of their time waiting for an order, while passengers spend 15-20 minutes on their phones during a ride, scrolling through social media.

Our idea is to integrate NCSpeech into services like that. For example, when you get into a taxi, you receive a push notification: “Want 20% off this ride? Complete a couple of tasks.” The user might be asked to take a photo of their palm, record a short selfie video, or dictate a piece of text.

– Who needs this kind of data?

Dmitry: – Companies working in biometrics are willing to buy datasets with palm photos, at around one dollar per image. Banks and developers of remote customer verification tools need face videos. And global AI companies need massive amounts of data from different countries so they can adapt their algorithms to local markets, languages, and real-life use cases.

– Why would super apps want to integrate with NCSpeech?

Dmitry: – For a super app, it’s a new revenue stream. And the partner doesn’t have to spend its own marketing budget on it: the discounts and bonuses for users are covered by the customer, the company buying the collected photos, videos, or audio recordings. On top of that, the app can benefit from the AI boom in other ways too. For example, simply showing that it is integrating AI can help increase the company’s valuation.

We handle the entire technical side, from integrating the module to verifying the datasets.

– Are companies willing to build this kind of feature into their apps?

Dmitry: – There are already examples of this kind of collaboration globally. Late last year, Uber launched a project in the US and India where drivers were offered data collection and labeling tasks during downtime. In March this year, DoorDash, the largest delivery company in the US, launched data collection around the “last meter”: its couriers take photos of building or restaurant entrances.

We’re now getting ready to launch a pilot with a major super app in Kazakhstan. The focus will be on collecting video and audio materials. We’re also in talks with a major partner in Southeast Asia.

“We expect to hit several million dollars in ARR pretty quickly”

– What kinds of companies need dataset collection services?

Dmitry: – The first tier is the global giants, like Google, Meta, and Anthropic, that need massive amounts of data to adapt their models to local realities.

The second tier is companies and governments building sovereign AI. This also includes major banks that need to train models inside a closed corporate environment.

Finally, specific datasets are also needed by startups that have raised funding to build niche AI services.

Today, we have several clients in Kazakhstan and Southeast Asia. Most of them are banks, financial institutions, and AI labs within large holdings.

– How much does it cost?

Tseren: – Collecting data is expensive. Project budgets start at tens of thousands of dollars, which is basically the minimum where projects like this even make sense.

That’s why we expect to reach several million dollars in ARR fairly quickly. The global market for AI data collection and labeling is currently estimated at $3.8 billion, and forecasts suggest it will exceed $10 billion in the next few years.

– What are the limitations in this market?

Tseren: – We don’t work with anything illegal or with companies under sanctions.

One Chinese company once asked us to collect photos and videos of sleeping children under the age of two, probably to train algorithms for a smart baby monitor. But since we couldn’t verify that, we turned it down.

Dmitry: – We strictly follow personal data protection laws. First, we design the tasks in a way that avoids collecting sensitive personal information. For example, if we need a voice recording, we might ask users to talk about the weather, not say their passport number.

Second, if random passersby’s faces or license plates appear in a photo or video, they’re automatically blurred or removed before the data is sent to the client.

And overall, the whole data collection process is completely voluntary. Before completing a task, the user signs an agreement that clearly explains why the data is being collected. If someone isn’t comfortable with it, they can simply skip the task.

Нуралы Жанбырбаев

“We want to raise around $3M from a strong lead investor in the US or Singapore”

– What stage is the project at now?

Tseren: – We’re actively scaling now. The project is already generating revenue and paying for itself. Now we need to grow the team by hiring strong product people, engineers, and sales specialists.

– How much have you already invested in NCSpeech?

Tseren: – For the first six months, we funded the project entirely out of our own pockets. No one was taking a salary, even though each of our engineers would be worth at least $5,000 a month on the market. During that period, AWS and Google Cloud cloud infrastructure grants really helped us. NCSpeech is also a resident of Astana Hub and receives tax benefits.

That summer, we raised $110,000 from Antler, a Singapore-based VC fund. And it wasn’t just about the money. We also got access to a strong network in Southeast Asia and, just as importantly, validation from a global player that the product really has value.

– Are you looking for investment right now?

Tseren: – As I mentioned, we’re already profitable operationally, but we need funding to move faster. We’re planning to raise around $3 million, and we’re looking for a strong lead investor from the US or Singapore who can bring more than just money, someone with real expertise and connections.

As engineers, we approached fundraising in a pretty systematic way. We look for founders who have already raised money from the funds we’re interested in, and then start building relationships with them. The goal is to get a warm intro to the investor. In Silicon Valley, that’s a reputation game. No one is going to recommend someone who clearly isn’t serious, because it could damage their own name. It’s not a quick process, and most likely, it will take in-person meetings too.

“It’s unfair that AI still doesn’t speak Kazakh well”

– Are there many competitors globally? And what makes you different?

Tseren: – There are big players in this market, like Scale AI, Toloka, Appen, and other companies already worth billions of dollars.

Our edge is in our approach and focus. First, we focus on markets where global models don’t work well. Second, we don’t spend money on user acquisition. Instead, we reach people through partner super apps and say: “You’re in a taxi, you have 15 minutes. Help us evaluate the quality of code or text in your field and get a bonus right away.” That way, we can instantly collect videos and photos from thousands of users.

– How do you see the project in the long run?

Tseren: – The big goal is to build a serious infrastructure player in data for emerging markets.

It’s not fair that AI models can understand English so well, but still struggle with languages like Kazakh or Malay. We want to become the go-to platform that companies and governments turn to when they need data to train their AI models.