*responsibilities*:
- design, develop, and implement data pipelines for ingesting, pre-processing, and transforming diverse data types (html, image,.pdf, audio, video) for generative ai model training and inference.
- engineer data for vector databases (e.g., pinecone, redis, chroma) and large language models (llm, gpt-4, claude 2.0) and for tasks like text summarization, entity extraction, and classification.
- build and maintain efficient data storage solutions, including data lakes, warehouses, and databases suitable for large-scale generative ai datasets. Implement data security and governance policies to safeguard the privacy and integrity of sensitive data used in generative ai projects.
- collaborate with data scientists and engineers to understand data requirements for generative ai models and translate them into efficient data pipelines.
- monitor and optimize data pipelines for performance, scalability, and cost-effectiveness.
- build analytical tools to utilize the data pipeline, providing actionable insight into key business performance metrics, including operational efficiency and customer acquisition.
- work with stakeholders, including data, design, product, and executive teams, and assisting them with data-related technical issues
- collaborate with stakeholders, including the executive, product, data, and design teams, to support their data infrastructure needs while assisting with data-related technical issues.
*qualifications*:
- bachelor's degree in computer science, data science, statistics, or a related field, or equivalent experience.
- 6+ years of proven experience in data engineering, etl, sql, database, json data, data pipeline development, building data platforms, and data storage technologies.
- 2+ years of experience building and maintaining data pipelines for machine learning projects.
- strong understanding of data structures, data modeling principles, data quality measures, and data security best practices, with experience in transforming, cleaning, and organizing unstructured data
- high proficiency in python, sql, and scripting languages.
- experience in continuous integration/deployments for large data pipelines and familiarity with containerization technologies (e.g., docker) and orchestration tools (e.g., kubernetes) for scalable and efficient model deployment.
- familiarity with implementing data and/or machine learning algorithms in production systems (e.g. Aws sagemaker, gcp datalab, or custom implementation);
- hands-on experience with cloud platforms (e.g., oci, aws, gcp, azure) for data storage and processing, along with gen ai services like oci gen ai, azure ai services, or aws bedrock.
- strong problem-solving skills and the ability to analyze data and design solutions to complex data issues.
- familiarity with modern etl stack (airflow, dbt, snowflake), data stream frameworks (kafka, kinesis), vector databases (e.g., pinecone, redis, chroma) and opensearch / elasticsearch.
- understanding of large language models (llm, gpt-4, claude 2.0) for tasks like text summarization, entity extraction, and classification.
- excellent communication skills and the ability to convey complex technical concepts to non-technical stakeholders.
- ability to work independently and collaboratively in a fast-paced environment.
- practical knowledge of agile project management and software development methodologies such as scrum and safe.
- experiencing working with globally distributed teams.