Generating Test Data for 10,000 EF Core Records Without Losing Your Mind

Why realistic test data matters

Our e-commerce platform had a product search endpoint that worked perfectly in development. The integration tests passed with 50 products. The QA environment had about 300 products. The performance tests ran fine. Then a client gave us access to their production data for a proof-of-concept migration: 11,400 products across 280 categories, with realistic names, descriptions, and pricing. The search endpoint that returned in 45ms with 300 products took 8 seconds with 11,400.

The query plan was the culprit — a missing index that didn't matter at low data volumes became critical at real-world scale. We had spent months building and testing this feature, and we only found the performance problem because a client handed us realistic data. That's backwards. I decided we needed a way to generate realistic production-scale data ourselves, before our clients discover the problems for us.

The manual approach that doesn't scale

// What many teams do — generates unrealistic data and doesn't scale
public static void SeedTestData(AppDbContext context)
{
    for (int i = 0; i < 10000; i++)
    {
        context.Products.Add(new Product
        {
            Name        = $"Product {i}",           // not realistic
            Sku         = $"SKU-{i:D6}",
            Price       = 9.99m,                    // same price every time
            Description = "Test product description", // identical for all records
            CategoryId  = (i % 10) + 1,             // perfectly even distribution
            CreatedAt   = DateTime.UtcNow            // all same timestamp
        });
    }
    context.SaveChanges();
}

This generates records, but they're useless for finding real bugs. The data distribution is perfectly uniform, the prices are identical, the names are sequential strings, and all records have the same creation timestamp. Real production data has skewed distributions (most orders in a few categories), varied prices (from $0.99 to $4,999), names with special characters and multiple words, and timestamps spread over months. Fake realistic data must be realistic to be useful.

Setting up Bogus

// Install via NuGet: dotnet add package Bogus
// Bogus is a .NET port of Faker.js by Brian Chavez
using Bogus;

// Define a Faker<T> rule set for each entity type
public static class ProductSeeder
{
    // Fixed seed for reproducible data across runs — same seed = same data
    private const int Seed = 12345;

    public static Faker<Product> CreateProductFaker(IReadOnlyList<int> categoryIds)
    {
        return new Faker<Product>()
            .RuleFor(p => p.Name, f =>
                f.Commerce.ProductName())                    // "Handmade Cotton Shirt"
            .RuleFor(p => p.Sku, f =>
                f.Commerce.Ean13())                          // "4006381333931"
            .RuleFor(p => p.Price, f =>
                Math.Round(f.Random.Decimal(0.99m, 4999.99m), 2)) // realistic range
            .RuleFor(p => p.Description, f =>
                f.Commerce.ProductDescription())             // generated text
            .RuleFor(p => p.CategoryId, f =>
                f.PickRandom(categoryIds))                   // weighted random selection
            .RuleFor(p => p.IsActive, f =>
                f.Random.WeightedRandom(
                    new[] { true, false },
                    new[] { 0.85f, 0.15f }))                 // 85% active, 15% inactive
            .RuleFor(p => p.CreatedAt, f =>
                f.Date.Between(                              // spread over 2 years
                    DateTime.UtcNow.AddYears(-2),
                    DateTime.UtcNow).ToUniversalTime())
            .RuleFor(p => p.StockQuantity, f =>
                f.Random.Int(0, 500))
            .UseSeed(Seed);
    }
}

Bulk seeding with EF Core efficiently

// Seed 10,000 products into a test database efficiently
public class DatabaseSeeder
{
    private readonly AppDbContext _context;
    private readonly ILogger<DatabaseSeeder> _logger;

    public DatabaseSeeder(AppDbContext context, ILogger<DatabaseSeeder> logger)
    {
        _context = context;
        _logger  = logger;
    }

    public async Task SeedAsync(int productCount = 10_000)
    {
        if (await _context.Products.AnyAsync())
        {
            _logger.LogInformation("Database already seeded, skipping.");
            return;
        }

        // First, seed categories (products depend on them)
        var categoryFaker = new Faker<Category>()
            .RuleFor(c => c.Name, f => f.Commerce.Department())
            .RuleFor(c => c.Slug, (f, c) => c.Name.ToLower().Replace(' ', '-'))
            .UseSeed(999);

        var categories = categoryFaker.Generate(50);
        await _context.Categories.AddRangeAsync(categories);
        await _context.SaveChangesAsync(); // commit to get database-assigned IDs

        var categoryIds = await _context.Categories.Select(c => c.Id).ToListAsync();

        // Generate and batch-insert products
        var productFaker = ProductSeeder.CreateProductFaker(categoryIds);
        const int batchSize = 500;
        var totalInserted = 0;

        while (totalInserted < productCount)
        {
            var batch = productFaker.Generate(Math.Min(batchSize, productCount - totalInserted));
            await _context.Products.AddRangeAsync(batch);
            await _context.SaveChangesAsync();

            totalInserted += batch.Count;
            _logger.LogInformation("Seeded {Total}/{Target} products", totalInserted, productCount);
        }

        _logger.LogInformation("Seeding complete: {Count} products across {Categories} categories",
            totalInserted, categories.Count);
    }
}

The batch-insert pattern with AddRangeAsync + SaveChangesAsync in batches of 500 is important for large datasets. Calling SaveChangesAsync once per record adds thousands of round-trips. Inserting all 10,000 records in a single SaveChangesAsync call loads all entities into EF Core's change tracker simultaneously, which can consume significant memory and cause slow change detection. Batches of 250-1,000 records balance memory use against round-trips.

Prototyping the schema with the Fake Data Generator

Before writing the Bogus configuration, I use the Fake Data Generator to prototype what each field should look like. This helps me answer questions like: should the SKU be numeric or alphanumeric? What's a realistic maximum length for a product description? What format should phone numbers be in for this client's market?

The Fake Data Generator lets me generate sample values for each field type quickly, without writing and running code. I check that the generated values look realistic, then translate each field type to the corresponding Bogus rule. For example, if I need phone numbers in Indian format, I check f.Phone.PhoneNumber("+91 ##########") in the fake data tool, verify the format, and use the Bogus PhoneNumbers module with the same format string.

Seeding in integration tests with a fixed schema

// Integration test that uses seeded data for performance validation
public class ProductSearchPerformanceTests : IClassFixture<WebApplicationFactory<Program>>
{
    private readonly WebApplicationFactory<Program> _factory;

    public ProductSearchPerformanceTests(WebApplicationFactory<Program> factory)
    {
        _factory = factory.WithWebHostBuilder(builder =>
        {
            builder.ConfigureServices(services =>
            {
                // Use in-memory SQLite for integration tests
                services.AddDbContext<AppDbContext>(opts =>
                    opts.UseSqlite("DataSource=:memory:"));
            });
        });
    }

    [Fact]
    public async Task ProductSearch_Returns_Under_200ms_With_10000_Products()
    {
        // Arrange: seed the test database with realistic data
        using var scope = _factory.Services.CreateScope();
        var db = scope.ServiceProvider.GetRequiredService<AppDbContext>();
        await db.Database.EnsureCreatedAsync();
        var seeder = scope.ServiceProvider.GetRequiredService<DatabaseSeeder>();
        await seeder.SeedAsync(productCount: 10_000);

        var client = _factory.CreateClient();

        // Act: measure search response time
        var sw = Stopwatch.StartNew();
        var response = await client.GetAsync("/api/products/search?q=shirt&category=5");
        sw.Stop();

        // Assert
        response.EnsureSuccessStatusCode();
        Assert.True(sw.ElapsedMilliseconds < 200,
            $"Search took {sw.ElapsedMilliseconds}ms — expected under 200ms");
    }
}

The performance bug we found

With 10,000 realistically distributed products, our search endpoint immediately showed the problem: a table scan on a LIKE '%shirt%' query that worked fine at low volumes was taking 4-8 seconds at production scale. The fix was a full-text search index on the product name and description columns. We would never have found this in unit tests with 50 records, and our QA environment's 300 records didn't surface it either. It took realistic volume and distribution to reproduce the real user experience.

The generalizable lesson

Performance bugs hide in unit tests. They live at the intersection of data volume and distribution — they only appear when you have enough realistic data for the query planner to make different decisions, for index selectivity to matter, for cache warming to change behavior. Generate realistic test data early, run it at the scale your production system will face, and fix the performance problems before your users find them.

Bogus makes this achievable with a few hours of setup. The Fake Data Generator lets you prototype field formats interactively before writing the Bogus rules. A fixed seed (.UseSeed(12345)) ensures the same data appears on every developer's machine and in CI, making performance regressions reproducible and debuggable.

Generating Test Data for 10,000 Entity Framework Records Without Losing Your Mind